DOCUMENT RESUME 

ED 342 761 SP 033 650 



AUTHOR 
TITLE 

INSTITUTION 
SPONS AGENCY 



PUB DATE 
CONTRACT 
NOTE 

PUB TYPE 

EDRS PRJCE 
DESCRIPTORS 



:dentifiers 



Estes, Gary D>; And Others 

Assessment Component of the California New Teacher 

Project: Second Year Technical Report. Volume 1. 

Far West Lab. for Educational Research and 

Development, San Francisco, Calif. 

California Commission cf Teacher Credentialing, 

Sacramento.; California State Dept, of Education, 

Sacramento. 

Feb 91 

TCC-8023 

435p.; For the first-year report, see ED 323 197. 
Reports - Evaluative/Feasibility (142) 

MF01/PC18 Plus Postage. 

^Beginning Teacher Induction; Elementary Secondary 
Education; ^Evaluation Criteria; ^-Evaluation Methods; 
Higher Education; ^Measurement Techniques; Preservice 
Tuacher Education; Teacher Certification; *Teacher 
Evaluation 

California New Teacher Project; *New Teachers; Reform 
•Efforts 



ABSTRACT 

This component of the California New Teacher Project 
describes comprehensive, innovative forms of assessment to be 
included in new teacher preparation and the credentialing process in 
order to promote successful induction of teachers into the 
profession. This report, organized into chapters, begins with an 
introduction that analyzes the administration and scoring of 
assessment instruments pilot-tested in 1990. The next chapter 
describes the pilot test design and processes used to evaluate the 
assessment approaches. The subsequent chapters present the assessment 
instruments in the fullowing order: Secondary Life/General Science 
Teacher Assessment, Laboratory Science Assessment, Assessment of 
Competence in Monitoring Students in the Classroom, Language Arts 
Pedagogical Knowledge Assessment, and Secondary English Assessment. 
The report concludes with a discussion of ease of administration, 
scoring, content and format, costs, and technical qualities. The 
final chapter includes: a summary of the strenghts and weaknesses of 
the assessment approaches represented by these instruments; 
conclusions about the effective design of training for assessors 
and/or scorers? and an identification of policy issues. Extensive 
lists of tables, figures, and appendices complete the document. 
(LL) 



********************************** 

* Reproductions supplied by EDRS are the best that can be made 

* from the original document. 



SCOPE OF INTEREST NOTICE 



The ERIC Facility hat assigned 
this document for processing 

to: 



CO 
Q 



In our judgment, iht* document 
is also of interest to the Clear 
inghouses notrd to the right- 
Indexing should reflect their 
special points of view. 



ASSESSMENT COMPONENT OF THE 



CALIFORNIA NEW TEACHER PROJECT: 



SECOND YEAR TECHNICAL REPORT 



VOLUME I 



FEBRUARY 1991 



ERIC 



U.S. DEPARTMENT OF EDUCATION 

OHk.P Of F Mutational FtBMJarch and Improvement 

EDUCATIONAL RF SOURCES INFORMATION 
CENTER lEH'C) 

; This document has r>*»r»n rep'odu(.ed as 
fRcewed from lhe person of organization 
originating it 

: Minor chhnyes have been made to improve 
reproduction quat»|y 



* PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



Prxnls of • o' options statfri m this dw:u 
men! do not necessarily t*prftsent official 
OERt position or policy 




TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



BEST COPY AVAILABLE 

FarWest 



LABORATORY 

1 



730 Harrison Street 
Son Francisco, CA 94107-1242 |p 



This report is based upon work performed pursuant to Contract TCC-8023 with the 
California Interagency Task Force of the California Commission on Teacher Credential- 
ing and the California State Department of Education New Teacher Project The 
contents do not necessarily reflect the views or policies of these agencies and the accu- 
racy of the information is the sole responsibility of the Far West laboratory. 



ASSESSMENT COMPONENT OF THE 
CALIFORNIA NEW TEACHER PROJECT: 
SECOND YEAR TECHNICAL REPORT 



VOLUME I 



Far West Laboratory for Educational 
Research and Development 

Gary D. Estes 
Kendyll Stansbury 
Claudia Long 
Kenneth Wolf 



February 1991 



'1 



I 

I 

b TABLE OF CONTENTS 



CHAPTER 1: INTRODUCTION 1.1 

Research on New and Experienced Teachers 1.2 

Support and Assessment of New Teachers in California 1.3 

California Teacher Credentialing Reforms 1.4 

The California New Teacher Project 1.6 

Assessment Component of the California New Teacher Project 1.6 

Pilot Testing in Spring and Summer, 1990 1.10 

CHAPTER 2: PILOT TEST DESIGN AND ANALYSIS 2.1 

Design of Pilot Tests 2.1 

Sources of Instrumentation 2.1 

Sampling Plans 2.2 

Data Collection 2.4 

Data Reduction 2.5 

Overview of Analytic Categories 2.6 

Administration of Assessment 2.6 

Assessment Content 2.6 

Assessment Format 2.7 

Cost Analysis 2.7 

Technical Quality 2.7 

CHAPTER 3: SECONDARY LIFE/GENERAL SCIENCE TEACHER ASSESSMENT . . 3.1 

Administration of Assessment 3.3 

Overview 3.3 

Logistics 3.5 

Identifying teacher samples 3.5 

Sending orientation materials 3.5 



I 




Security 3.6 

Assessors and Their Training 3 « 7 

Teacher and Assessor Impressions 

of Administration 3,7 

Scoring 37 

Scoring Process 3,7 

Scorers and Their Training 3 - 10 

Assessment Content 3, 12 

Congruence with California Model Curriculum 

Guides and Frameworks 3< 1 3 

Extent of Coverage of California Standards 

for Beginning Teachers • 3 «20 

Job-Relatedness 3 -24 

Teacher perceptions 3 «24 

Scorer perceptions 3 -26 

Appropriateness for Beginning Teachers 3 -27 

Perceptions 3< 27 

Performance on assessment tasks 3 «30 

Appropriateness across Contexts 3 - 33 

Grade level 333 

Diverse students 3 - 34 

Fairness across Groups of Teachers 3 - 37 

Appropriateness as a Method of Assessment 3 «38 

Comparison with other assessments 3 «40 

Assessment Format 3 •« 

Format Features 3< 42 

Clarity of Pioparatory Materials 3 -42 

Clarity of Task Instructions 3 - 44 

Length of Tasks 3 « 47 

Clarity of Scoring Criteria and Procedures 3 -47 



2 



9 

ERIC 



6 



Cost Analysis 3.48 

Administration and Scoring Costs Estimate 3.48 

Development and Pilot Testing Costs 3.49 

Technical Quality 3.49 

Reliability 3.51 

Interrater agreements 3.51 

Internal consistency of the tasks 

and assessment 3.52 

Validity of Agreement Through Group Comparisons 3.54 

Content validity 3.54 

Conclusions and Recommendations 3.57 

Administration of Assessment 3.57 

Scoring 3.58 

Assessment Content 3.59 

Assessment Format 3.60 

Summary .... 3.61 

CHAPTER 4: SCIENCE LABORATORY ASSESSMENT 4.1 

Administration of Assessment 4.5 

Overview 4.5 

Logistics 4.6 

Recruiting and training observers 4.6 

Identifying the teacher sample 4.6 

Scheduling the observations 4.8 

Sending orientation materials 4.8 

Collecting evaluation feedback 4.8 

Security 4.8 

Assessors and Their Training 4.9 

Characteristics of the assessors 4.9 

Training , 4.9 

Perceptions of training 4.10 

Scoring 4.11 

Teacher, Assessors, and FWL Staff Perceptions 

of Administration 4.12 



o 

iERJC 



Assessment Content 4,14 

Congruence with the 1990 California 

Science Framework 4 -15 

Extent of Coverage of California Standards 

for Beginning Teachers 4.17 

Job Relatedness 4 22 

Appropriateness for Beginning Teachers 4.22 

Perceptions 4 - 24 

Performance on assessment 4.24 

Appropriateness across Contexts 4.30 

Across grade levels 4.30 

Diverse students 4 -32 

Fairness across Groups of Teachers 4 -35 

Appropriateness as a Method of Assessment 435 

Appropriateness 4.35 

Comparison 4 -36 

Assessment Format 4 -37 

Clarity of the Teachers' Preparation Materials 4.38 

Clarity of the Conference Questions 4.40 

Clarity of the Forms and Process for 

Documentation and Rating 4.41 

Guided note-taking form 4 44 

Documentation sorting record 4.46 

Summary report form 4 - 4 8 

Cost Analysis 

Administration and Scoring Cost Estimates 4-52 

Assessor time and costs 4.52 

Training costs for assessors 4.53 

Other costs 4 -53 

Development and Pilot Testing Costs 4.53 

Cost Summary 4 -5 4 



4 

S 

o 

ERIC 



Technical Quality 4.54 

Development 4.54 

Reliability 4.56 

Conclusions and Recommendations 4.57 

Administration of Assessment 4.57 

Assessment Content 4.59 

Summary 4.62 

CHAPTER 5: ASSESSMENT OF COMPETENCE IN MONITORING STUDENT 

ACHIEVEMENT IN THE CLASSROOM 5.1 

Administration of Assessments 5.3 

Overview 5.3 

Logistics 5.3 

Identifying teacher samples 5.6 

Oriental a materials 5.6 

Conducting the assessment 5.6 

Conducting staff development 5.6 

Obtaining feedback from the teachers 5.6 

Assessors and Their Training 5.7 

Teacher Impressions of Administration 5.7 

Scoring 5.8 

Scoring Process 5.8 

Scorers and Their Training 5.9 

Scorers 5.9 

Training of scorers 5.10 

Perceptions of training 5.11 

Assessment Content 5.11 

Congruence with California Model Curriculum 

Guides and Frameworks 5.12 

Language arts 5.13 

Science 5.14 

Social science 5.15 

Mathematics 5.16 



! 5 
9 

ERJC 



Extent of Coverage of California Standards 

for Beginning Teachers 5.17 

Job-Relatedness 5.18 

Teacher perceptions 5.18 

Scorer perceptions 5.19 

Appropriateness for Beginning Teachers 5.19 

Teacher perceptions 5.19 

Scorer perceptions 5.22 

Performance on assessment tasks 5.22 

Appropriateness across Contexts 5.24 

Grade level and subject matter 5.24 

Diverse students 5.26 

Fairness across Groups of Teachers 5.29 

Appropriateness as a Method of Assessment 5.29 

Teacher perceptions 5.30 

Scorer perceptions 5.30 

Comparison with other assessments 5.31 

Assessment Format 5.32 

Format Features 5.32 

Clarity of Assessment 5.32 

Clarity of questions 5.32 

Clarity of scoring criteria 5.34 

Evaluation of staff development training 5.35 

Cost Analysis 5.36 

Administration and Scoring Costs 5.36 

Development and Pilot Testing Costs 5.37 

Technical Quality 5.39 

Development 5.39 

Reliability 5.40 

Interrater agreements 5.40 

Interrater correlations 5.40 

Internal consistency of the assessment forms 5.40 

6 

10 

ERIC 



Validity of Agreement through Group Comparisons 5.43 

Content validity 5.44 

Conclusions and Recommendations 5.46 

Administration of Assessment 5.46 

Scoring 5.46 

Assessment Content .47 

Assessment Format 5.49 

Summary 5.49 

CHAPTER 6; LANGUAGE ARTS PEDAGOGICAL KNOWLEDGE ASSESSMENT ... 6.1 

Administration of Assessments 6.4 

Overview 6.4 

Logistics 6.6 

Identifying teacher samples 6.6 

Sending orientation materials 6.6 

Assessment administration 6.6 

Collecting evaluation feedback 6.7 

Security 6.7 

Scoring 6.8 

Scoring Process 6.8 

Scorers and Their Training 6.9 

Scoring characteristics 6.9 

Training 6.9 

Perceptions of training , 6.10 

Teacher and FWL Staff Impressions 

of Administration , ... 6.11 

Assessment Content 6.11 

Congruence with California Curriculum 

Guides and Frameworks 6.12 

Extent of Coverage of California Standards 

for Beginning Teachers 6.16 

Job-Relatedness 6.19 



7 

11 



Appropriateness for Beginning Teachers 6,19 

Perceptions 6.19 

Performance on assessment 6.22 

Appropriateness across Contexts 6.24 

Grade level 6 -24 

Diverse students 6.24 

Fairness across Groups of Teachers 6.27 

Appropriateness as a Method of Assessment 6.28 

Comparison with other assessments 6.29 

Assessment Format • 6.30 

Clarity of Teacher Preparation Materials 6.30 

Clarity of Task Materials 6.31 

Suggestions for improving the 

task directions 6.31 

Suggestions for improving 

the questions 6.32 

Other suggestions for improving 

the materials 6.32 

Clarity of the Scoring Criteria and Procedures 6.33 

Cost Analysis 6.36 

Administration and Scoring Cost Estimates 6.36 

Development and Pilot Testing Costs 6.37 

Technical Quality 6.37 

Development • 6.37 

Reliability 6.39 

Interrater agreements 6.39 

Interrater correlations 6.41 

Validity of Agreement Through Group Comparisons 6.42 

Content Validity 6.42 

8 

12 

o 

ERIC 



Conclusions and Recommendations 



6.43 



Administration of Assessment 6.43 

Assessment Content 6.45 

Assessment Format 6.47 

Summary 6.48 

CHAPTER 7: SECONDARY ENGLISH ASSESSMENT 71 

Administration of Assessment Center Activities 7.4 

Overview 7.4 

Logistics 7.5 

Identifying the teacher sample 7.5 

Recruiting and training of assessors 7.7 

Scheduling/Arranging the assessments 7.7 

Developing and sending the orientation 

materials , , . . . 7.7 

Collecting evaluation feedback 7.8 

Security 7.8 

Assessors and Their Training 7.9 

Characterises of the assessors 7.9 

Training 7.9 

Perceptions of training 7.10 

Scoring 7.11 

Teacher, Assessor, and FWL Staff Perceptions 

of Administration 7.12 

Assessment Content 7.13 

Congruence with the California English/Language 

Arts Framework and Handbooks > 7.14 

Extent of Coverage of California Standards 

for Beginning Teachers 7.17 

Job-relatedness 7.19 

Appropriateness for Beginning Teachers 7.22 

Perceptions 7.22 

Performance on assessment < 7.25 

Appropriateness acrosp Contexts 7.27 

Grade level 7.28 

Diverse students 7.28 

9 

13 



0 

ERIC 



Fairness across Groups of Teachers 7.30 

Appropriateness as a Method of Assessment 7.32 

Appropriateness 7.33 

Comparison of activities with other 

assessments 

Assessment Format 7, ^6 

Clarity of the Teachers' Preparation Materials 7,36 

Appropriateness of Time Allotted for Each 

Activity 740 

Clarity of the Rating Forms and Process 7.43 

Cost. Analysis 

Administration and Scoring Cost Estimates 7.51 

Development and Pilot Testing Costs 7.52 

Technical Quality 7.52 

Development 7.52 

Reliability 7.54 

Interrater agreements 7.54 

Interrater correlations 7.56 

Internal consistency of the tasks 

and assessment 7.58 

Intercorrulations among activities 7.59 

Validity of Agreement Through Group Comparisons 7.59 

Content validity 7.62 

Conclusions and Recommendations 7.62 

Administration of Assessment 7.62 

Assessment Content 7.63 

Assessment Format 7.65 

Summary 7.68 



10 



9 

ERIC 



u 



CHAPTER 8: CONCLUSIONS 8.1 

Assessment Approaches 8.1 

Structured Simulation Tasks 8.1 

Definition 8.1 

Characteristics of instruments piloted 8.1 

Strengths and weaknesses 8.2 

Classroom Observations (Subject Matter Focus) 8.4 

Definition 8.4 

Characteristics of instrument piloted 8.4 

Strengths and weaknesses 8.5 

Videotaped Teaching Episodes 8.6 

Definition 8.6 

Characteristics of instrument piloted 8.6 

Strengths and weaknesses 8.7 

Performance-Based Assessment Center Exercises 8.8 

Definition 8.8 

Characteristics of instrument piloted 8.8 

Strengths and weaknesses 8.8 

Guidelines for the Design of Training 8.10 

Cost Estimates 8.11 

Policy Issues 8.12 

BIBLIOGRAPHY B.l 



ERIC 



11 

15 



TABLE 3.1 
TABLE 3.2 
TABLE 3.3 

TABLE 3.4 
TABLE 3.5 
TABLE 3.6 
TABLE 3.7 
TABLE 3.8 
TABLE 3.9 
TABLE 3.10 

TABLE 4.1 
TABLE 4.2 

TABLE 4.3 

TABLE 4.4 

TABLE 4.5 



TABLES 

pilot Test Participants: Secondary Life/General Science 

Teacher Assessment 3.4 

Coverage of the California Science Framework by the 

Secondary Life/General Science Teacher Assessment 3.14 

Extent of Coverage by the Secondary Life/General Life 

Science Teacher Assessment of California Standards for 

Beginning Teachers 3,25 

Teacher Performance, by Subpart, on the Secondary Life/ 

General Science Teacher Assessment 3.31 

Teacher Perceptions of the Preparatory Materials for the 

Secondary Life/General Science Teacher Assessment 3.43 

Teacher Perceptions of the Clarity of Task Instructions for 

the Secondary Life/General Science Teacher Assessment 3.45 

Developmental and Pilot Test Costs for the Secondary Life/ 

General Science Teacher Assessment 3 - 5 ° 

Internal Consistency of Tasks: Secondary Life/General Science 

Teacher Assessment 3.53 

Intel-correlations Among Tasks: Secondary Life/General Science 

Teacher Assessment 3.55 

Trends of Mean Differences in Task Performance Between Candidates 

with Different Characteristics: Secondary Life/General Science 

Teacher Assessment 3.56 

Pilot Test Participants: Science Laboratory Assessment 4.7 

Coverage of the California Science Framework by the Science 

Laboratory Assessment • 4,i0 

Extent of Coverage by the Science Laboratory Assessment of 

California Standards for Beginning Teachers 4.23 

Suggested Changes to Pre-Observation Conference Questions: 

Science Laboratory Assessmei t 4.42 

Developmental and Pilot Test Costs for the Science 

Laboratory Assessment 4.55 



12 



9 

ERIC 



16 



TABLE 5.1 Participation in Staff Development Activities by Participation 

in Pre- or Post-Tests: Assessment of Competence in Monitoring 

Student Achievement in the Classroom 5.4 

TABLE 5.2 Pilot Test Participants: Assessment of Competence in Monitoring 

Student Achievement in the Classroom 5.5 

TABLE 5.3 Topics of Exercises Reported Being Too Difficult: Assessment of 

Competence in Monitoring Student Achievement in the Classroom 5.21 

TABLE 5.4 Teacher Performance by Form, Pre- or Post-Test, and Participation 
in Staff Development: Assessment of Competence in Monitoring 
Student Achievement in the Classroom 5.23 

TABLE 5.5 Distribution of Scenarios in Exercises Across Grade Levels: 

Assessment of Competence in Monitoring Student Achievement in 

the Classroom 5.25 

TABLE 5.6 Developmental and Pilot Test Costs for the Assessment of 

Competence in Monitoring Student Achievement in the Classroom 5.38 

TABLE 5.7 Correlations Between Paired Ratings for the Assessment of 

Competence in Monitoring Student Achievement in the Classroom 5.42 

TABLE 5.8 Trends of Mean Differences in Performance Between Candidates 
with Different Characteristic* Assessment of Competence in 
Monitoring Student Achievement in the Classroom 5.45 

TABLE 6.1 Some Characteristics of the Four LAPKA Scenarios for Assessing 

a Teacher's Pedagogical Content Knowledge in Language Arts . r 9 

TABLE 6.2 Pilot Test Participants: Language Arts Pedagogical Knowledge 

Assessment (LAPKA) 6.5 

TABLE 6.3 Congruence of the Language Arts Pedagogical Knowledge Assessment 
(LAPKA) with the English-Language Arts Model Curriculum Guide for 
Kindergarten through Grade Eight 6.15 

TABLE 6.4 Extent of Coverage by the Language Arts Pedagogical knowledge 
Assessment (LAPKA) of the California Standard for Beginning 
Teachers 6.20 

TABLE 6.5 Performance Data for Pilot Test Teachers (N=42) for the Language 

Arts Pedagogical Knowledge Assessment (LAPKA) 6.23 

TABLE 6.6 Developmental and Pilot Test Costs for the Language Arts 

Pedagogical Knowledge Assessment (LAPKA) 6.38 



13 



17 



TABLE 7.1 Pilot Test Participants: Secondary English Assessment 7.6 

TABLE 7.2 Congruence of the Secondary English Assessment with the 

English-Language Arts Framework and Handbooks 7.16 

TABLE 7.3 Extent of Coverage by the Secondary English Assessment 

of California Standards for Beginning Teachers 7.20 

TABLE 7.4 The Number of Teachers Receiving Each Rating in the 

Evaluation Categories for Each Activity 7.26 

TABLE 7.5 Developmental and Pilot Test Costs for the Secondary 

English Assessment 7.53 

TABLE 7. ) Correlations Between Raters for the Secondary English 
Assessment Activities for Holistic Rating (RT) and 

Summed Ratings (S) 7 - 57 

TABLE 7.7 Trends of Mean Differences Between Candidates with 
Different Characteristics for Activities and Evaluation 
Categories 7, ^ u 

TABLE Statistical Comparison of Teacher Performance on the 

Secondary Life/General Science Teacher Assessment A.l 

TABLE Statistical Comparison of Group Performance on the Assessment 

of Competence in Monitoring Student Achievement C.l 

TABLE Statistical Comparison of Teacher Performance on the Language 

Arts Pedagogical Knowledge Assessment E.1 

TABLE Statistical Comparison of Teacher Performance on the Secondary 

English Assessment F.l 

FICURES 

FIGURE 4.1 List of Domains and Elements: Science Laboratory Assessment 4.2 

FIGURE 4.2 Three Elements and Defining Indicators of the Materials/ 

Equipment Domain: Science Laboratory Assessment . 4.3 

FIGURE 4.3 Percent of Teachers Receiving a "2 B Rating on Each Domain 4.25 

FIGURE 4.4 Domains Teachers Believe Could Only Be Passed W7 2 Yrs. 

Clasb oom Experience 4.28 

FIGURE 4.5 Domains Teachers Believe Could Be Passed Immediately After 

Student Teaching 429 

14 

18 

9 

ERIC 



FIGURE 4.6 Percent of Teachers By Grade Level Receiving a "Two" Rating 

on Each Domain 4.31 

FIGURE 5.1 Rating Differences Between Scorers ... 5.41 

FIGURE 6.1 % Agreement of Raters for the Language Arts Pedagogical 

Knowledge Assessment 6.40 

FIGURE 7.1 Response Form A: Responding to Student Writing 7.3 

FIGURE 7.2 Percent Agreement Between Raters for the Secondary 

Enf^ish Assessment Activities 7.55 

APPENDICES 

APPENDIX A: Statistical Comparison of Teacher Performance on the 

Secondary Life/General Science Teacher Assessment A.l 

APPENDIX B: Science Laboratory Assessment: Content and Forms B.l 

APPENDIX C: Statistical Compa^son of Group Performance on the 

Assessment of Competence in Monitoring Student Achievement C.l 

APPENDIX D: An Example of a Scoring Sheet for the Language Arts 

Pedagogical Knowledge Assessment D.l 

APPENDIX E: Statistical Comparison of Teacher Performance on the Language 

Language Arts Pedagogical Knowledge Assessment E.l 

APPENDIX F: Statistical Comparison of Teacher Performance on the Secondary 

English Assessment F.l 



ERJC 



15 

1» 



I 

I 



CHAPTER 1: 
INTRODUCTION 



Si) 

o 

ERLC 



INTRODUCTION 



Recent analyses of education in the United States have identified significant areas of 
ineffectiveness (Boyer, 1983; Goodlad, 1984; President's Commission for Excellence in 
Education, 1983), and have included important proposals for reform (Holmes Group, 1986; 
Shulman, 1987; Carnegie Corporation, 1986). Many of these analyses and proposals have 
addressed the quality of the teaching force, with particular focus on the preparation, 
support, and credentialing of new teachers. Some of the leading educational scholars in the 
nation have concluded that the standards for membership in the teaching profession are 
inadequate, that few states require beginning teachers to attain acceptable levels of 
competence in classroom teaching, and that the quality of instruction in the public schools 
suffers as a result of fragmented approaches to teacher preparation, certification, induction 
and career development. 

In the growing literature on educational reform, the emphasis on new teachers has 
been part of a broader discussion of promoting teaching as a profession (e.g., Wise and 
Darling-Hammond, 1987; Shulman and Sykes, 1986). Several leading advocates of 
educational reform have examined the standards of other professions (medicine, law, 
architecture, engineering, accountancy, etc.), and have argued that more rigorous and 
comprehensive assessments of teachers' knowledge and competence should be developed and 
adopted (Holmes Group, 1986; Shulman, 1987; Carnegie Corporation, 1986). After 
examining the histories cf several professions, these researchers have suggested that the 
stature of a profession depends in part on the extent to which it verifies the professional 
knowledge and competence of each member in a broad assessment that must be passed in 
order to practice the profession in each state. In a series of reports that have been widely 
acclaimed, the critics recommended the development of stronger assessments as a way of 
strengthening instruction in the schools as well as public confidence in the teaching 
profession. 

These same reform advocates have also recommended the creation of stronger 
support systems for beginning teachers in the schools. The practice of giving the most 
difficult teaching assignments to new teachers is increasingly identified as a major cause of 
the high rate at which novices leave teaching (Griffin and Millies, 1986; Ward, 1991; Odell, 
1986; Hurling-Austin, 1986; Ryan, 1980). Similarly, the historically weak systems of 
supervision, evaluation, and assistance for beginning teachers \ m been based on a 

1.1 



21 



mistaken assumption that the completion of teacher preparation programs in universities 
provides a "complete" basis for the successful practice of school teaching. In fact, 
prospective teachers need structures to further their preparation in ways that bring together 
the elements of undergraduate preparation, teacher education, student teaching, and initial 
classroom work in a comprehensive way. This need is particularly vital in California in the 
1990's where the diversity of backgrounds, languages, and academic preparation make 
teaching especially challenging. Recent analyses of these conditions have suggested that 
comprehensive support and assessment systems must be added to the new teacher 
preparation and credentialing process in order to promote the successful induction of 
teachers into an effective, and respected profession. 

Research on New and Experienced Teachers 

The jducational reform efforts in California and across the nation have been 
motivated, in part, by the literature which identifies the technical, socioemotional, and 
institutional needs of new teachers, and explores the differences between new and 
experienced teachers. New teachers, for example, report significant difficulties with the 
technical aspects of teaching, including classroom management (Veenman, 1984), curriculum 
implementation (Grant and Zeichner, 1981; Veenman, 1984; Berliner et al., 1987), and 
managing diversity within the classroom (Grant and Zeichner, 1981; Veenman, 1984; Borko 
et al., 1986; Berliner et al., 1987; Berliner et al., 1988). 

Socioemotionally, many new teachers experience insecurity, self-doubts, and 
substantial stress as they face the problems of acquiring and developing materials, lesson 
plans and tests without the expertise and materials that seasoned teachers draw upon. The 
typically brief period of supervised practice before assuming full teaching responsibilities, 
combined with working conditions which usually isolate teachers from their peers, provide 
new teachers with only limited opportunities to develop realistic standards for their 
performance (Moir, 1990). Not surprisingly, new teachers both need and usually appreciate 
someone who is willing to listen to their problems - both personal and professional - and 
offer supportive and useful feedback (Borko et al., 1986). 

Institutionally, new teachers face the problems of having to quickly become familiar 
with district and school policies, practices, and procedures; learning about resources and 
how to access them; and becoming integrated into the community of teachers in the school. 
Many new teachers experience difficulties and frustration in locating and absorbing this 
critical information (Grant and Zeichner, 1981; Odell, 1986). 



1.2 



22 



The research on new teachers also focuses on identifying stages at which different 
skills develop. The knowledge base of teaching is very complex, and the period of training is 
brief-especially compared to other professions which tend to provide for a more gradual 
assumption of professional responsibilities (Wise and Darling-Hammond, 1987). Preset-vice 
courses and experience , no matter how well structured, cannot fully prepare teacher 
candidates to perform as excellent practitioners in the classroom. The emerging literature 
on differences between new and experienced teachers ouggests that some skills may be 
present in only rudimentary form in new teachers. Compared to new teachers, for example, 
experienced teachers are more likely to see lessons as composed of general pedagogical 
routines for specific purposes, such as introducing new concepts, applying concepts 
previously learned, reviewing content previously learned, collecting homework, etc. 
(Leinhardt, 1989). Expert teachers also see the subject matter organized in frameworks, 
while novice teachers see it as more of a collection of facts (Wilson, 1988; Leinhardt, 1989). 
Knowledge of students and student learning also seems to be a skill that develops with 
experience in teaching (Leinhardt, 1983; Wilson, 1988). 

Support and Assessment of New Teachers in California 

Becoming a teacher in California is much the same as in other states. An individual 
can qualify for a teaching position by earning a baccalaureate degree in any field, completing 
a one-year post-graduate program of teacher education, and passing standardized tests of 
basic skills and content knowledge. During the 1980*8, the support systems for beginning 
teachers in California consisted largely of (1) cooperating classroom teachers who supervised 
candidates during student teaching, (2) mentor teachers who assist new teachers and train 
experienced colleagues, and (3) principals of school in which new teachers were hired. 
Similarly, the beginning teacher assessments consisted of standardized multiple-choice tests 
and the evaluation of performance during student teaching and probationary employment. 

In 1984 and 1985, the traditional systems of new teacher support and assessment 
were examined in considerable depth in California (Commission on Teacher Credentialing, 
1985; Commission on Teacher Quality, 1983; California Commission on the Teaching 
Profession, 1985). in 1987, the California Department of Education (CDE) and the 
Commission on Teacher Credentialing (CTC) cosponsored a series of policy seminars at 
Stanford University on "New Teachers For California: Issues of Support and Assessment." 
These and other analyses made the following conclusions: (1) Student teaching and the 
professional education courses that accompany student teaching are important elements of 
teacher preparation, but they are insufficient for many new teachers to become skillful, 
proficient professionals; (2) Student teachers practice in environments that are considerably 

1.3 



different from the settings in which they ultimately teach; (3) In addition to state reviews of 
teacher education programs, candidates need to demonstrate their individual readiness for 
teaching through a candidate-based assessment system; (4) Many of the complexities and 
nuances of effective teaching are learned during the teacher's initial classroom work; and, 
finally, (5) New approaches to teacher credentialing should include a model of licensure that 
takes into account new teachers' classroom pedagogy, subject matter knowledge, and ability 
to relate to students. 

Traditionally, the primary supervisors of new teachers have been site administrators. 
The growing demands on school principals have made it increasingly difficult, however, for 
these local educational leaders to attend to the needs of beginning teachers in timely, 
intensive ways. New teachers are most often employed in schools with large, crowded 
classrooms of students who are increasingly diverse in their languages, academic, and 
cultural backgrounds. Most principals don't have sufficient time (and in some cases, 
expertise) to provide high quality support for new teachers in these contexts. Experienced 
teachers, another logical source of support for new teachers, also lack sufficient time to 
provide intensive support. Moreover, to be effective supporters of new teachers, research 
suggests that experienced teachers need not only time, but also authority, compensation and 
training. 

On the other side of the new teacher support coin is accountability. In 1984-85, the 
"reform commissions" concluded that the traditional assessments of beginning teachers were 
inadequate to the challenge of verifying the competence of each new professional. With 
little prior training, thousands of classroom teachers assessed the performances of student 
teachers on the basis of standards and criteria that were varied, unclear, and poorly related 
to the changing realities of California classrooms. Furthermore, knowing that the 
prospective teacher's career depended largely on the evaluation he or she received as a 
student teacher, virtually all supervisors awarded outstanding grades to the novices whom 
they also provided guidance and assistance. Evaluations made by school principals of their 
new hires were also rarely negative, and, due to the large differences between school 
districts, the standards and procedures used for evaluations tended to be uneven and 
unreliable. 

California Teacher Credentialing Reforms 

Several reform initiatives undertaken by the CTC and the CDE since 1985 have been 
devoted to the successful resolution of issues related to the comprehensive support and 
assessment of beginning teachers. In concert with local teachers and administrators, the 

1.4 



ERIC 



24 



CTC examined the tests that new teachers were required to pass for California teaching 
credentials. The Commission found that the tests of teachers* content knowledge--tH NTE 
Core Battery and the NTE Speciality Area Tests-were not current with the changes in 
California's reform curriculum. The multiple-choice format of the test questions could not 
assess the thinking skills in which teachers need to engage their students when thinking 
about science, mathematics, languages, history, social science and the arts (Wheeler, et. al., 
1988). The CTC is currently examining the performance characteristics of the California 
Basic Educational Skills Test (CBEST). However, because the CBEST was designed to 
verify a minimum level of proficiency in basic academic skills that should be acquired during 
elementary and secondary schooling, it is not intended to be a test of teaching ability. The 
Commission recently initiated several studies to explain why this test continues to be 
difficult for disproportionate numbers of minority examinees. 

The Commission on Teacher Credentialing reappraised the support and evaluation of 
student teachers, which are now the subject of Standards of Program Quality and 
Effectiveness that the Commission adopted in 1986 and strengthened in 1988. The CTC 
created teams of teachers and teacher educators to review each teacher education program 
on the basis of these new standards, which require universities to establish documentary 
evidence of each teacher's performance in relation to ten uniform criteria of effectiveness. 

At the same time, the Commission established panels of other subject-matter experts 
to develop new examinations of the content knowledge of future teachers. New exams will 
include subject-matter performance exercises as well as multiple-choice questions, and they 
will replace the NTE Specialty Area Tests and the NTE Core Battery Test beginning in 
1991-92. To ensure that the new exams will be congruent with the Model Curriculum 
Standards. K-8 Guidelines, and State Curriculum Frameworks, the California Department of 
Education has been an active partner in these changes in subject-matter examinations. 

Although these reforms promise to contribute to the effectiveness of the California 
teacher force in the future, they essentially leave intact the conditions in which beginning 
teachers work after completing their initial credential requirements. To address the state's 
induction of new teachers, including the proposals to establish support systems and 
assessment systems for first- and second-year teachers, the CTC and CDE are jointly 
administering the California New Teacher Project, which was authorized by policy 
legislation and budget appropriations beginning in 1988. 



The California New Teacher Project 



The California New Teacher Project (CNTP) was created by the legislature in the 
Teacher Credentialing Law of 1988 (Chapter 1355 of the Statutes of 1988). Charged with 
exploring innovative methods of new teacher support and assessment, the CN . ? has three 
components: support, evaluation, and assessment. A brief overview of each component and 
the overall goals of the CNTP are found in this section; the assessment component is 
described in more detail in the following section. 

The support component of the CNTP consists of local pilot projects representing 
diverse teaching contexts as well as a variety of approaches to supporting new teachers. 
During the first year (1988-89), fifteen projects funded through a combination of state and 
local sources participated. TH number of projects was increased in the second year (1989- 
90) to include addition " jts either expanding the representation of approaches to new 
teacher support or con - s programs of district-funded support and receiving funds only 
to participate in CNTP meetings and data collection efforts. Although these projects are not 
the only new teacher support programs in California, teachers and administrators in these 
projects are a key component of the research on alternative methods of new teacher support 
sponsored by the CNTP. 

The evaluation component of the CNTP is designed to investigate the effects of the 
various methods of support on new teacher effectiveness and retention, as well as cost- 
effectiveness. The variety of approaches to new teacher support combined with the 
evaluation of these approaches should help to identify the forms and intensity of assistance 
that are most effective with new teachers entering the profession. The CTC and SDE have 
contracted with the Southwest Regional Laboratory (SWKL) to conduct all activities in the 
evaluation component. The evaluation results of the first two years of the CNTP can be 
found in two reports: 1988-89 Evaluation Report (SWRL, 1990) and 1989-90 Evaluation 
Report (SWRL, 1991). 

Asseoiment Component of the California New Teacher Project 

Many of the reform advocates have criticized the exclusive use of multiple-choice 
tests in traditional teacher licensure systems. According to many teachers, teacher 
educators and researchers, multiple-choice questions cannot assess many of the important 
skills and abilities that characterize proficient, effective teachers. These advocates have 
recommended that states examine the efficacy of other methods for assessing the 
capabilities of credential candidates, methods such as on-site observations, oral interviews, 

16 



26 



structured exercises in assessment centers, and the use of videotaped scenarios and other 
"prompt materials" in performance assessments. Each of these recommendations was 
intended to make the assessment of teaching more authentic in relation to teachers' actual 
duties and requirements. When education policymakers in California faced the choice of 
assessment methods, however, they quickly discovered that few, if any, of the recommended 
methods had been pilot-tested or evaluated in practice. The literature on education reform 
was "long" on suggestions but "short" on evidence of the cost-effectiveness of varied methods 
of assessing teacher competence and performance. 

To help tiucation policymakers with their choice of assessment methods, the 
assessment component of the CNTP was designed to develop and pilot test innovative forms 
of new teacher assessment. The evaluation of diverse approaches to teacher assessment is 
intended to identify the most promising ways in which a comprehensive assessment of 
teacher candidates could inform the credentialing process and contribute to the quality of 
teaching. This document reports the analysis of the pilot tests of assessments that were 
completed during 1990, the second year of the CNTP. The analysis of the first year of pilot 
testing appears in a previous report, Assessment Component of the California New Teacher 
Project Year One Report. The pilot tests were administered and analyzed by Far West 
Laboratory for Educational Research and Development (FWL). The design and purpose of 
the second year of pilot testing are described in Chapter 2. The 1990 pilot tests differ from 
the 1989 pilot tests in that the 1990 assessments were specifically commissioned by the 
California New Teacher Project to increase the diversity of assessment approaches 
represented in the research and to better reflect California's curriculum and diversity of 
students. 

The Bergeson Act (S.B. 148) which created the CNTP specifically requires that each 
alternative method of support and assessment be evaluated along the following dimensions: 

• effectiveness at retaining capable beginning teachers in the profession; 

• effectiveness at improving the pedagogical content knowledge and skills of the 
beginning teachers who are retained; 

• effectiveness at improving the ability of beginning teachers to teach students who 
are ethnically, culturally, economically, academically, and linguistically diverse; 



• effectiveness at identifying beginning teachers who need additional 
assistance and, if that additional assistance fails, who should be 
removed from the educational profession; 

• the relative costs of the method in relation to its beneficial effects; and 

• the extent to which an alternative method of supporting or assessing beginning 
teachers would, if it were added to the other state requirements for teaching 
credentials, make careers in education more or less appealing to prospective 
teachers. 

Although both the support and assessment components are guided by relevant state 
curriculum frameworks and expectations for the pedagogical competence of new teachers, 
the SDE and CTC have not generated a list of competencies to serve as a common focus for 
all components of the CNTP. Instead, to increase the variety of methods being evaluated, 
the assessment component is conducted independently of the evaluation and support 
components. For this reason, the competencies being measured by the assessment 
instruments piloted may or may not coincide with the areas of support offered to the new 
teachers by the support projects. The integration of the lessons learned from the evaluation 
and assessment components will facilitate an analysis of the relationships and interactions 
among teacher preparation, support, assessment, and credentialing to suggest whether and 
how a program of support and assessment for new teachers should be developed. 

In examining current approaches to teacher assessment, CTC and SDE staff found 
few assessment approaches that are closely related to the tasks that teachers perform in the 
course of their work. This lack has led nationally to the development of alternatives to 
multiple-choice tests, which historically have been the dominant form of large-scale teacher 
assessments. The alternatives are often refe-Tad to as "innovative" or "performance-based" 
assessments because of their emphasis on direct measurement of actual teacher 
performance. 

A variety of performance-based teacher assessments has been developed in recent 
years, including a number of observation instruments which have been adopted as teacher 
credentialing requirements in other states. However, many of these instruments are very 
prescriptive in terms of teaching style. Since California classrooms are extremely diverse, 
instruments which tend to promote only one or a few teaching styles are inappropriate for 
use in assessing California teachers. For this reason, the CNTP is designed to evaluate the 

1.8 



degree to which various assessment approaches measure the ability of teacher candidates to 
teach a wide variety of students. 

The Bergeson Act retlects an emerging design for California's assessment of teacher 
candidates in four areas: (1) basic academic skills; (2) subject matter knowledge; (3) subject 
specific pedagogy; and (4) general pedagogy. The CBEST has been judged to be suitable for 
assessing candidate performance in the first area (Watkins, 1985), and revisions are under 
way in the second area in tests that measure subject matter knowledge of elementary 
teachers (NTE Core Battery) and secondary teachers (NTE Specialty Area Tests). The 
third and fourth areas, which were judged to be most effectively assessed after candidates 
have had some experience in conducting their own classrooms (i.e., in the first year or two 
of teaching), are the primary focus of the CNTP. The CNTP aims to identify promising, 
cost-effective assessments of subject-specific pedagogy and general pedagogy, especially in 
the following areas: Secondary English, Secondary Mathematics, Secondary Life Science, 
Secondary Physical Science, Secondary Social Science, and Elementary Teaching. 

Becau e of the high interest in teacher assessment among educators in recent years, 
together with a growing recognition of the limitations of the multiple-choice approach, new 
assessment approaches are being developed, and old approaches are being revised. New 
approaches include the use of videotapes, written vignettes, structured interviews, 
structured simulations, and reviews of portfolios of a teacher's work. More traditional 
approaches such as classroom observation are being revised and refined so as to go beyond 
the checklist format and to move toward an instrument which provides rich information 
with strong diagnostic potential. 

In planning the research to be conducted in the assessment component of the CNTP, 
staff from the CTC and SDE considered both the high cost of assessment development and 
the desirability of evaluating a wide variety of assessment approaches. Many "innovative" 
assessment instruments are in the initial stages of development, and could only serve as 
initial prototypes for exploring the potential of an assessment approach, rather than as 
state-of-the-art instruments reflecting a long period of experimentation within that 
approach. The most promising state-of-the-art instruments representing assessment 
approaches in later stages of development were, for the most part, pilot tested during the 
first year of the CNTP. Therefore, to maximize the information to be gathered while 
minimizing developmental costs, the assessment instruments commissioned for pilot testing 
the second year were not quired to be fully developed products with well established 
validity a*id reliability. Instead, the second year's pilot testing was designed to yield 
information about the strengths and weaknesses of assessment approaches for which the 




specific instruments serve as exemplars. The purpose of the pilot testing is not to consider 
particular instruments for adoption, but to identify promising approaches to the assessment 
of teachers, to guide future selection and/or development of instruments which are tailored 
to the California context. Consistent with this purpose, assessment prototypes were piloted 
on a small scale with a thorough trouble-shooting process in order to learn as much as 
possible about the strengths and weaknesses of each approach before incurring the expense 
of large-scale field tests. 

Pilot Tailing in Spring and Summer, 1980 

The instruments pilot tested during 1990 were commissioned during the first year of 
the CNTP to represent subject matter areas or assessment approaches which had been 
insufficiently explored. These instruments and the approaches which they represent are as 
follows: 



Instrument 



A pproach 



Secondary Life/General 
Science Teacher Assessment 



Structured Simulation Tasks 



Laboratory Science Assessment 



Subject-Matter Specific 
Classroom Observation 



Assessment of Competence 
in Monitoring Student 
Achievement in the Classroom 



Structured Simulation Tasks 



Language Arts Pedagogical 
Knowledge Assessment 



Videotaped Teaching Episodes 



Secondary English Assessment 
Assessment Center 



Terformance-Based Exercises 



The evaluation of the various components (e.g., logistical requirements, prompt 
materials, scoring criteria, training of assessors and/or scorers) of the instruments was 
intended to provide information about the strengths and limitations of the assessment 
approaches which the specific instruments represented. The pilot tests were not expected to 
yield definitive measurements of the psychometric properties of the instruments because the 



1.10 



9 

ERIC 



30 



prototypes had not been sufficiently developed for that to occur. This focus on trouble- 
shooting allows small-scale pilot testing, requires fewer resources, and considerably 
increases the number of assessment approaches which can be examined. The goal of the 
pilot tests is to suggest whether or not it is advisable to invest additional resources in the 
development of assessments resembling those piloted. 

This document is the final report and analysis of the administration and scoring of 
the assessment instruments pilot tested in 1990. The next chapter describes the pilot test 
design and the processes used to evaluate the assessment approaches which were examined 
in 1990. In the chapters that follow, each of the assessment instruments is described, with 
each chapter including a discussion of the ease of administration, scoring, content and 
format, costs, and technical qualities of the instrument. The instruments are presented in 
the following order: the Secondary Life/General Science Teacher Assessment, the 
Laboratory Science Assessment, the Assessment of Competence in Monitoring Student 
Achievement in the Classroom, the Language Arts Pedagogical Knowledge Assessment, and 
the Secondary English Assessment. The report concludes with a summary of strengths and 
weaknesses of the assessment approaches represented by these instruments, conclusions 
about the effective design of training for assessors and/or scorers, and an identification or 
augmentation of policy issues beyond those discussed in the first year report which will 
affect the design of a teacher assessment system. 



1.11 



31 



CHAPTER 2: 
PILOT TEST DESIGN AND ANALYSIS 



CHAPTER 2: 
PILOT TEST DESIGN AND ANALYSIS 



This chapter describes the design and analysis of the pilot tests of prototypes 
representing varioi s assessment approaches. Different sections describe the source of 
instrumentation, the sampling plans, sources of information for evaluating the instruments 
and the assessment approaches, methods of data reduction and major categories of analysis. 
Deviations from the design due to unanticipated events will be described in following 
chapters which focus on the individual instruments. 

Design of Pilot Teats 

This section on the design of the pilot tests describes the sources of instrumentation 
and the sampling plans. Procedures fov data collection and analysis will be described i n the 
sections on data collection and data reduction. 

Sources of Instrumentation 

In the first year of the project, the prototype instruments that were pilot tested were 
selected on the basis of their representation of state-of-the-art development o? innovative 
assessment approaches. For this second year of pilot testing, the Interagency Task Force 
commissioned the development of additional prototypes through a competitive bidding 
process. It was intended that these new prototypes would be more congruent with the 
California Model Curriculum Guides than those pilot tested in the first year of the project 
which were developed for other states or a national audience. The new prototypes 
commissioned were also chosen to represent a variety of assessment approaches. Each will 
be described separately. 

The Secondary life/Genera! Science Teacher Assessment is a set of structured 
simulation tasks to which teachers respond in writing. The tasks are chosen to represent 
important responsibilities which differentiate more and less competent beginning teachers. 



2.1 



9 

ERIC 



33 



The Science Laboratory Assessment, developed by the RMC Corporation in Mountain 
View, California, combines classroom observation with structured interviews to measure 
both general pedagogical skills and instructional skills in a science laboratory setting. 

This assessment was developed by the Rand Corporation in Santa Monica, using the same 
process that was used to develop performance tasks for the Bar examination to license 
lawyers. The current set of tasks does not represent a complete assessment, but rather 
prototype tasks that may eventually be incorporated into a complete assessment. 

The Assessment of Competence in Monitoring Student Achievement in the Classroom 
uses a set of classroom-based scenarios ending with a question or a short series of 
questions to which teachers respond in writing. It was developed by the Northwest 
Regional Educational Laboratory (NWREL) in Portland, Oregon, based on a decade of 
research and development of training on the topic of classroom assessment. 

The Language Arts Pedagogical Knowledge Assessment (LAPKA) was developed by 
a second team from NWREL. It uses videotapes of teachers instructing small groups of 
their students to portray a variety of approaches to language arts instruction. The 
videotapes are stopped at various points to pose questions to which the teachers respond in 
writing. 

The Secondary English Assessment, developed by San Francisco State University, 
consists of two parts: a set of three assessment center exercises, and a portfolio 
documenting teaching of a unit of instruction. This report focuses only on the assessment 
center exercises. The portfolio phase is in progress as this report is being written and will 
be analyzed in the Year Three Report. 

The assessment developers provided guidelines for administration and, except for the 
Secondary English Assessment, supervised training of scorers and observers. Veteran 
English educators who had participated as scorers in an earlier administration of the 
assessment served as trainers for the assessment center phase of the administrator/scorers 
of the Secondary English Assessment. All observers and scorers were recruited by FWL 
staff; some of these had participated in the development of the assessments. 

Sampling Plans 

Our goal was to obtain a broad sample of teachers representing both genders as well 
as a variety of ethiiieities and teaching contexts. In addition, we desired participation from 

2.2 



ERJ.C 



34 



teachers representing the range of grade levels included in the credential that was the focus 
of the assessment. 

In the case of the Assessment of Competence in Monitoring Student Achievement in 
the Classroom, two districts were recruited to provide groups of elementary teachers to 
participate in the assessment. For the other assessments, recruitment of individual teacher, 
was necessary. We began the sample selection process by assembling lists of possible 
participants within each project in the California New Teacher Project (CNTP). For the 
science assessments, it was apparent that teachers outside the CNTP would need to be 
contacted in order to reach the desired sample size. In these cases, personnel offices of 
virtually every school district in the greater Bay Area and Los Angeles area were contacted 
to obtain lists of teachers. Once these lists were completed, the characteristics of grade 
level, school context (e.g., inner city, suburban, rural), gender and ethnicity were considered 
in selecting teachers to contact for possible participation in an assessment. 

Other factors influenced the choice of teachers for various assessments. For the 
Secondary Life/General Science Teacher Assessment and the Language Arts Pedagogical 
Knowledge Assessment, it was important that teachers be located reasonably close to a 
central assessment site. Teachers were individually observed for the Science Laboratory 
Assessment, so teacher selection entailed balancing the goals of minimizing travel costs, 
matching teacher and observer availability, and obtaining a broad sample. The Secondary 
English Assessment was administered at a single site during one week in the summer. 
Since statewide representation in all assessments was desired, the budget provided for half 
the teachers to travel to the assessment by air and half by local transportation. This 
budgetary constraint guided sample selection for the Secondary English Assessment. 

Although we wanted to maximize variation in the characteristics of teachers selected, 
our ability to do so was limited by the information which we had about project teachers, the 
time required to recruit nonproject teachers, and the small samples. Information on the 
ethnicity of teachers was available for many of the projects, but there were few nonwhite 
teachers, precluding the selection of a significantly large subsample. Our information on 
school context was limited to our knowledge of the districts participating in the various 
projects, augmented by conversations with the Project Directors and teachers. 

Even though obtaining a broad sample of teachers was a goal, this was not possible 
for all assessments. Considerations of administrative costs and geographic dispersion of 
teachers led to an underrepresentation of rural teachers in most assessments. The 
recruitment of minority teachers was a priority, but locating minority teachers proved to be 

2.3 



35 



difficult The number of minority teachers participating in the assessments ranged from 
three to fifteen. The characteristics of teachers in the samples are described in more detail 
in the chapters that focus on specific instruments. 

This section describes our procedures for data collection and reduction, as well as the 
key analytic categories focusing on specific aspects of instruments. The data collected also 
served as a basis for judging the potential of the assessment approach which the particular 
instrument utilized. 

Data Collection 

Since the same means of data collection were used for all assessment instruments, 
they will be discussed together. Several sources of data were used: 

• evaluation feedback forms completed by teachers who participated in the pilot 
tests; 

• evaluation feedback forms completed by the observers and scorers; 

• observations of the administration of each assessment and the training of 
observers and scorers recorded in field notes by FWL staff; 

• scores that reflected the performances of participating teachers on the 
assessment instruments; 

• review of instruments or portions of instruments by an expert on teaching 
diverse students; and 

• the most recent relevant Curriculum Guide(s) and/or Framework(s) and the 
California Standards for Beginning Teachers. 

Using the list of analytic categories and the evaluation feedback forms developed 
during the first year of the project, FWL staff developed separate forms for each group 
which were tailored to specific assessment instruments. These forms were given to teachers 
upon the completion of each assessment, except in the case of the classroom observation 
instrument, where they were mailed. Observers and scorers returned completed forms with 
their invoices for payment. Since the emphasis in the pilot tests was on trouble shooting, 
the evaluation feedback forms focused on critical evaluations of the instruments with 

2.4 



3d 



respect to the analytic categories described in the next section. Most of the questions 
required yes/no or fixed response answers with spaces provided to elaborate. 

Field notes were taken during observations of the assessment administrations. FWL 
staff conducted most administrations of the assessment instruments, and accompanied one 
observer during the use of the observation instrument. FWL staff also observed the training 
of observers and scorers. For the Assessment of Competence '": Monitoring Student 
Achievement in the Classroom, FWL staff also served as participant observers for scoring to 
obtain a more complete understanding of the performance of the assessment instruments. 

The content of each prototype was compared to all of the relevant California Model 
Curriculum Guides and Frameworks, and with the California Standards for Beginning 
Teachers. The Model Curriculum Guides and Frameworks are recent documents produced 
by subject matter panels convened by the California State Department of Education. 
Reflecting a consensus among panel members on the content and philosophy of instruction, 
these documents are expected to guide curriculum development and instruction in the 
subject in California public schools. If there were two or more Guides or Frameworks 
addressing a particular subject area, the most recent one was used. 

The California Beginning Teacher Standards are standards that define the level of 
pedagogical competence and performance that the Commission on Teacher Credentialing 
expects the graduates of credential programs to attain as a condition for program approval. 
These standards-Standards 22 through 32--are listed in Standards of Program Quality and 
Effectiveness, Factors to Consider and Preconditions in the Evaluation of Professional 
Teacher Preparation Programs for Multiple and Single Subject Credentials. (Other 
standards address more general program requirements; these focus specifically on candidate 
competencies.) Although these are standards for teacher preparation programs and not 
teacher candidates, they identify the knowledge and skills that beginning California teachers 
are expected to attain. 

Data Reduction 

Data reduction techniques varied with the data collection method. Fixed-response 
questions on the evaluation feedback forms completed by teachers participating in the pilot 
tests, observers, and scorers were tabulated. Open-ended responses ana elaborations were 
compiled. Responses which either stated a common viewpoint well, or which provided an 
additional perspective, were highlighted for possible quotation in the reports. For the 
fixed-response questions where elaboration was invited, the focus was on identifying 



weaknesses in the instruments and on soliciting suggestions for improvement. Therefore, 
teachers were only asked to comment on negative responses, so there were many more 
negative evaluations available for quotation than positive ones. 

Field notes were reviewed for relevant information that addresses the analytic 
categories and were incorporated into the chapters about specific instruments. 

When numbers were large enough to permit analysis of scores by subgroup, the 
following comparisons were made: male/female, mmority/nonminority, and teachers at 
different grade levels and in different locations (urban, rural, inner city). 

At least a portion of each assessment was scored by two people to assess inter-rater 
reliabilities. Scores were also used to estimate the internal consistency of an instrument. 

The Model Curriculum Guides and Frameworks were examined by FWL staff. Their 
professional judgments were used to draw conclusions about each assessment instrument's 
extent of coverage and congruence with the relevant Guide or Framework. The reasoning 
underlying these judgments is described in detail in the chapters on the specific prototypes. 

Overview of Analytic Categories 

The same general analytic categories were used to appraise all assessment in- 
struments. They included: administration, content, format, cost analysis, and technical 
quality. These categories and their subcategories are discussed below. 

Administration of Assessment 

This category included consideration of the logistics, security needs, and training of 
observers and scorers for the particular assessment instrument. Generally, this category 
generated information required to estimate administrative requirements and cost 
projections. The lofistics required for administration predict the ease of administration if 
the assessment approach were to be implemented on a statewide basis. The more 
complicated the logistical requirements, the more expensive the assessment is to administer. 
Security needs impact not only logistical requirements, but also the frequency with which 
the instrument must be revised for statewide administration. Consideration of the training 
of observers and scorers suggests the degree of difficulty to be anticipated in recruiting 
people with the required professional expertise, and the time required to prepare personnel 
to administer and score the particular assessment instrument. 

2.6 



38 



Assessment Content 



This category addressed the specific instrument's congruence with the relevant 
Curriculum Guide Framework, and the extent to which the California Standards for 
Beginning Teachers were covered. It also included an examination of the content of the 
assessments along the following dimensions: job-relatedness, appropriateness for beginning 
teachers, appropriateness across varying teaching contexts, fairness across different groups 
of teachers, and general appropriateness of the assessment approach represented by the 
prototype as a method of assessing teachers. Comparison of the assessment content with 
the relevant Curriculum Guide and the California Standards for Beginning Teachers was 
necessary to determine whether the assessment approach was compatible with the 
instructional philosophy underlying the various California curricula and the competencies 
specified for teacher candidates. Since one common criticism of teacher assessment 
instruments is that scores have not been shown to be closely related to specific teaching 
competencies, job relevance was included as an analytic category. The more closely the 
assessment tasks resemble the activities that teachers do in the course of their teaching 
duties, the higher the potential relationship of scores to actual teaching competencies. 

Since the CNTP focuses on the assessment of teachers early in their teaching career, 
it is important to judge the appropriateness of each assessment in terms of performance 
expectations and perceived difficulty for teachers at this stage of career development. 
Appropriateness across contexts is particularly important for California, since it has a wide 
diversity in student populations. The issue of fairness across groups of teachers relates to 
the potential for bias with regard to any particular group of teachers (e.g., gender, 
ethnicity). 

Assessment Format 

This category included the general clarity of orientation materials, directions for 
completing the assessment, and scoring criteria. In order for the performance of candidates 
to reflect their true competencies, it is essential that all candidates have clear and accurate 
expectations of the performance that is expected of them. This is not possible when 
teachers are uncertain as to what they are being asked to do. It is equally important that 
scorers have a clear understanding of the criteria by which they are judging a teacher's 
performance. 



I 2.7 

I erlc 



CottAxudjtii 

Baaed on the pilot teating experience, we attempted to project the costs of a 
statewide administration and scoring of an instrument which resembled the prototype 
tested. We also have reported the costs for the developers to develop these prototypes and 
for FWL's pilot testing. The developmental costs experienced to date provide a rough basis 
for judging the developmental costs for assessment approaches like these. 

Technical Quality 

This category discussed the work performed to date in the development of the 
prototype, together wi+h data estimating the reliability and validity of the instrument. 

This chapter has outlined the general dosign for the 1990 Spring and Summer pilot 
teats in the asaessment portion of the California New Teacher Project. The following five 
chapters discuss each of the assessment approaches pilot tested: a aet of structured 
simulation tasks (the Secondary life/General Science Teacher Assessment), a classroom 
observation assessment accompanied by brief structured interviews (the Science Laboratory 
Assessment), a aet of written classroom-based scenarios (the Assessment of Competence in 
Monitoring Student Achievement in the Classroom), a aet of edited videotapes portraying 
instruction requiring written responses to questions (the Language Arts Pedagogical 
Knowledge Assessment), and a set of assessment center exercises (the Secondary English 
Assessment). (The Secondary English Assessment has an additional portfolio component, 
which is currently being implemented. The portfolio will be analyzed in the third year 
report.) 



2.8 



40 



CHAPTERS: 

SECONDARY LIFE/GENERAL SCIENCE TEACHER ASSESSMENT 



•11 



CHAPTERS: 



SECONDARY LIFE/GENERAL SCIENCE TEACHER ASSESSMENT 



The secondary life/general science teacher assessment, developed by the Rand 
Corporation, is a set of structured simulation problems to which a teacher responds in 
writing. A complete assessment was not developed; development work focussed on the 
construction of prototypic tasks, each of which may eventually be combined with other tasks 
to form a complete assessment. (To {simplify references to these prototypic tasks, however, 
they will be referred to collectively as "the assessment.") For this pilot test, all stimulus 
materials were in written form, although the assessment developer sees videotape as a 
possible alternative stimulus. 

To facilitate future development of parallel tasks, the construction of each task 
begins with the design of a blueprint for production, which the developer terms a "shell." 
No two shells have exactly the same features and components. However, most sheila 
provide the following: 

• a general description of the activity or types of activities that will be 
present in a task, (e.g., "grade a set of student papers that exhibit at 
least five of the following characteristics...") and the general directions 
to candidates; 

• things that can be built into a task that candidates should attend to in 
specified ways (and which can be scored with respect to whether the 
candidate did or not attend to them, e.g., one answer is symptomatic 
of a common learning difficulty or disorder); 

• the types of materials candidates will receive (both in advance of the 
test and at the test site); and 

• any special features of the context that need to be explained. 
Many different items or case situations can be generated from the same sliell. 



For this pilot test, the following five tasks were fully developed, from stimulus 
materials through scoring criteria: 

(1) Applying Effective Instructional Techniques. A teacher reads a simulated 
transcript containing several lesson segments from a single class, and identifies 
appropriate and inappropriate actions and statements made by the teacher in 
the script, commenting on why each is appropriate or inapp jyr.ate. 

(2) Teacher as Curriculum Decision-Maker. This task consists of two parts. In the 
first part, the teacher combines a subset of given activities (e.g., lectures, 
laboratories, films, tests) into a two-week (ten day) lesson plan to achieve a given 
set of student objectives for the classroom described. In the second part, the 
teacher also provides a rationale for the overall plan. 

(3) Parenl/Student Letter. This task also consists of two parts which relate to 
drafting a letter regarding a science course, for which descriptions of the course 
and the students are provided. The letter is to be sent to parents and students 
at the beginning of the school year. In the first part, the teacher lists reasons 
why the course would be important and of value to students. In the second part, 
the teacher develops an outline of additional topics, including any required by 
law, to be included in the letter. 

(4) Lesson Planning. The two parts of this task focus on a specific lesson in a unit. 
A description of students, the instructional goals of the unit, and the other 
lesson topics in order of presentation are provided. In the first part, the teacher 
analyzes the strengths and weaknesses of three alternative lessons designed to 
fill the missing slot in the sequence of lessons in the unit. In the second part, 
the teacher designs a more effective lesson and describes its strengths and 
weaknesses. The teacher is free to modify one of the lessons provided or to 
design a new one. 

(5) Classroom and Facility Safety. This task consists of three parts. In the first 
part, the teacher provides a list of categories of activities (excluding facilities) 
that teachers can do alone or with their students at the beginning of the year to 
promote classroom laboratory safety. In the second part, the teacher Msts 
specific actions that would promote safety and prevent or reduce the likelihood 
of accidents in a specific laboratory activity with the classes of students 



3.2 



43 



described. In the final part, the teacher identifies safety hazards in a 
drawing oi part of the classroom. 

In addition, six other tasks were partially developed, ranging from the shell stage to 
a revised draft of the shell, stimulus materials, and scoring criteria. These other tasks 
addressed the following topics: the transition to the laboratory, common scientific 
misconceptions, understanding student behavior, using computers as tools, evaluating 
student performance, and meeting special needs. However, only the five tasks described 
above were pilot tested. 

Two forms of Applying Effective Instructional Technique* were pilot tested. Each 
form contained four lesson segments, and six lesson segments were developed. Two 
segments were common to both forms, and each form contained two of the remaining four 
segments. 

Administration of Assessment 

The administration of the assessment, the assessment content, and the assessment 
format are discussed below. The discussion of the Secondary Life/General Science 
Assessment concludes with a summary of our evaluations of its potential as a prototype for 
further assessment development. 

Overview 

The Secondary Life/General Science Assessment was administered at five sites in the 
Bay Area and the greater Los Angeles area between June 2 and June 23, 1990. As seen in 
Tabk 3.1, a total of 65 teachers participated, the majority of whom were female. The 
teachers included sixteen minority teachers. A little over half of the teachers taught in 
either a middle school or a junior high school; two additional teachers had teaching 
^ignraente split between junior high and higj schools. Approximately three-quarters of 
the teachers graduated from traditional teacher preparation programs. Nearly all the 
remaiwing teachers participated in intern programs, where they received their pedagogical 
trainin * while assuming sole responsibility for their classes of students. The two teachers 
whose training fell into the "other" category received teacher training through the Peace 
Corps. 



3.3 



44 



TABLE 3.1 



PILOT TEST PARTICIPANTS 
SECONDARY LIFE/GENERAL SCIENCE TEACHER ASSESSMENT 



(Number of Teachers 


= 65) 






Distributions of Pflrt^Pfinte 


. . 

• • • . • ■ • • . . • • • • . 

t • . " . : • • .. * 


Form A 


Form 8 


io&lr^ Participants """ 


; - : ^Nw32j.;:?::^ 


Gender 






Male 
Female 


11 
21 


14 
19 


Ethnirty 






Asian 

Black 

Hispanic 

Native American 

White 

Other 

No Response 


2 
1 

5 
0 

24 
0 
0 


4 

1 

0 
2 

24 
1 
1 


Grade Level 






Middle/Junior High school 
High School 

Both Junior and High School 


18 
12 
2 


19 
14 
0 


Source of Teacher Preparation 






Intern Program 

Regular Credential Program 

Other 

No Response 


7 
24 

: 


7 
24 
2 
0 



3.4 



9 

ERIC 



45 



With the exception of the intern teachers, participating teachers were their first or 
second year of teaching. The intern teachers were either in their second and final year of 
training or in their first year of teaching following completion of the program. 

Two different forms of the assessment, with two segments of one task in common, 
were piloted. Thirty-two teachers completed Applying Effective Instructional Techniques 
(Form A), Teacher as Curriculum Decision-Maker, and Parent/Student Letter. Thirty-three 
teachers completed Applying Effective Instructional Techniques (Form B), Lesson Planning 
and Classroom and Facility Safety. The tasks were administered in the order listed. 
Teachers were given an hour to complete the first task, ninety minutes to complete the 
second, and forty-five minutes to complete the third. 

Logistics 

Administration required the following logistical activities: identifying a sample of 
teachers, sending orientation materials to teachers, administering the assessment, and 
acquiring evaluation feedback from the teachers. 

Identifying teacher samples. The California New Teacher Project contained too few 
science teachers to provide a sample for the assessment. Therefore, unlike the other pilot 
tests, most participants in this assessment were Non-project teachers. We focussed on the 
state '8 two largest urban areas, the greater Los Angeles area and the Bay Area, to locate a 
sufficient number of first- and second-year life science teachers. The personnel office of 
most school districts in these two geographic areas was contacted and asked to either supply 
the names and school sites of any appropriate teachers or, if their policy prevented the 
release of names, to forward a letter to the appropriate teachers inviting them to participate 
in a pilot test. When the names of teachers were obtained, a letter was sent to them 
followed by a telephone message left at their school site inviting them to call collect for 
more information. Many of the teachers identified turned out to be in their first or second 
year in the district, but had more than two years experience. The majority of the bona fide 
first and sec nd-year teachers contacted agreed to participate in the assessment. More 
teachers than needed were scheduled to participate to allow for some attrition. 

Sending orientation materials. The assessment devel^^er provided the orientation 
material for the teachers, which consisted of brief descriptions of six possible tasks which 
they would be asked to do (including one which was not pilot tested) and a list of 27 
possible science topics which might serve as the focus for the tasks. In addition, teachers 
received a letter briefly describing the California New Teacher Project and its Assessment 

3.5 

4R 



Component, and directions to the assessment site. Teachers were paid $ 80 for 
participating in the assessment and completing an evaluation form. 

The assessment was designed to be administered to large groups by a test 
administrator who distributed and collected materials, announced the start and end of each 
task, and monitored the teachers to prevent cheating. No special training or background in 
science was needed, as the instruction/? were designed to be self-evident. 

The only requirement which differed from those of traditional group-administered 
tests was that of sufficient surface area (e.g., individual desks or a number of tables) to 
spread out a number of materials. Facilities which fit this requirement proved to be easy to 
locate, and included classrooms used by a district for professional development, a large 
conference room, and a room in a medical center set up for classroom instruction. 

Each assessment began with a ten-to fifteen-minute overview of the research design 
underlying the California New Teacher Project. Teachers were given the option of a five to 
fifteen minute break between tasks, but usually opted to limit the break to five minutes to 
finish earlier. In the overview, which was similar for all pilot tests (except the classroom 
observation assessment), the following topics were covered: (1) the purpose of the pilot 
testing and descriptions of the spring pilot test activities; (2) identification of the 
assessment developer and distinctions between the roles of the assessment developer and 
FWL; (3) the confidentiality and use of the results; and (4) a description of the evaluation 
form which teachers would complete at the end of the assessment. 

Test materials were distributed in three manila envelopes, with each envelope 
containing a single task. The envelopes were labeled with both the task code and an ID 
number. Teachers were instructed to record that ID number on the test materials and the 
evaluation form. 

Security 

It is the position of the test developer that once the test is given, its security is 
compromised, and new forms of the tasks must be developed. Therefore, security 
precautions coupled with the fiscal need to i educe development costs dictate that it be 
administered to large groups in various locations on the same date. Facilitating the 
development of parallel tasks to maintain both security and fairness led the developer to 
conceive of the "shell" system for generating tasks. 



3.6 



9 

ERIC 



47 



The tasks would almost certainly be memorable. Some are more amenable than 
others to coaching through memorization of acceptable answers, e.g., Part II of the 
Parent/Student Latter, where the teacher lists topics other than course content to be 
covered in the letter. This list would be similar regardless of the course content described. 
However, learning test-taking techniques and common answers would be less useful for a 
task such as Teacher as Curriculum Decision-Maker or parts of Classroom and Facility 
Safely, where answers depend on the content, and the teacher needs to be able to apply 
general principles in light of the specific content portrayed. 

Assessors and Their Training 

Two members of the FWL staff administered the assessment. No training was 
provided other than instructions about times for the tasks and suggested breaks. Before the 
first assessment, the two staff members designed the assessment schedule, including time 
for the overview and the evaluation form. No need for further training in test 
administration was detected by the staff, who were experienced in conducting assessments, 
although standardized guidelines for dealing with possible situations, e.g., a test-taker 
becoming ill during the test, would be needed for statewide administration. 

Teacher and Assessor Impressions of Administration 

Teachers responded favorably when asked their impressions of the arrangements for 
administration, including scheduling, room arrangements, and distance to travel to the 
assessment site. Fifty-seven of the sixty-five teachers (88%) responded that the 
arrangements were reasonable. Comments critical of the arrangements addressed travel 
distance, early morning traffic coinciding with the time at which the test was scheduled, and 
security at one site where a teacher discovered vandalism to his car which was assumed to 
have occurred during the testing. 

Scoring 

The discussion of scoring addresses the scoring process, the scorers and their 
training 

Scoring Process 

The scoring guide is built into the tasks during development. For example, in Lesson 
Planning, scorers do not grade a candidate's ability to distinguish between appropriate and 

3.7 

IS 



inappropriate plans. Neither are teachers asked to list the rules for good plans. Instead, 
teachers evaluate actual plans, and scorers determine whether the teacher responded 
appropriately to a specific situation in which the ability to evaluate plans was needed. The 
scoring process differs slightly among tasks, but is generally based on correct identification 
of appropriate or inappropriate items built into the stimulus materials. Points are deducted 
for responses which are clearly wrong. For several teachers, this resulted in a negative 
score on one or more parts of a task. Scoring guides are modified after the task is 
administered, such as when the examinees see certain strengths or weaknesses that were 
not anticipated by task developers. The scoring system for each task will be discussed 
separately in more detail. 

The task Applying Effective Instructional Techniques consisted of four simulated 
segments of a single class. The segment included both a transcript of teacher/student 
conversations and, when needed to interpret the transcript, a description of what the 
teacher or the students were doing. Teachers were asked to identify both appropriate and 
inappropriate actions by the teacher and to briefly comment on them. 

Certain appropriate or inappropriate actions were built into the script when it was 
constructed, such as building upon previous instruction or reprimanding one student and 
not another for similar behavior. Scorers were presented with a list of these appropriate 
and inappropriate actions built into the vcript. A few additions to the list based on teacher 
responses were made during the initial training to score the task. In the case of any teacher 
responses not already covered by the list, scorers were instructed to base their judgement on 
the previously identified examples. Teachers received one point each for every appropriate 
or inappropriate action they correctly identified. If a teacher identified an action 
incorrectly, i.e., said it was appropriate when it clearly was not or vice versa, one point was 
deducted. Some teacher comments were labeled "neutral" during the training, as when the 
comment was judged to be too vague or when the teacher went beyond the script in making 
assumptions about the teacher behavior. These "neutral" comments received a "zero" score. 
However, when teacher assumptions clearly contradicted the information provided in the 
script and accompanying materials, a point was deducted. 

The two parts of Teacher as Curriculum Decision-Maker were scored differently. 
The first part consisted of choosing a subset of activities provided and arranging them into 
a two week unit of instruction given specified unit objectives and a description of the group 
of students in the classroom. Activities were divided into the following categories: 
Lecture/Discussion, Demonstration, Reading, Laboratory Activity, Film/Video, Student 
Worksheet, Homework-in-Class, and Testing/Evaluation. This part was scored using an 

3.8 



ERIC 



49 



algorithm which awarded a teacher 100 points and then deducted varying points for the 
following: (1) incorrectly sequencing antecedent and subsequer* activities when bciii were 
included in the unit; (2) omitting necessary prerequisites for activities included; (3) failing 
to cover one or more unit objectives (which were provided); (4) including topics which were 
tangential to the unit; (5) including activities which were too difficult for the class 
described; (6) failing to assemble enough activities, including homework in class, to cover a 
50*60 minute class period (times were provided for each activity except homework-in-class); 
(7) failing to include a variety of activities over the entire unit (i.e., not having lessons 
predominantly composed of activities from a single category); (8) failing to have a variety of 
activities each day; (9) using too much instructional time for homework in class; (10) 
assigning too much weekly homework; and (11) giving too many tests within the two-week 
period. When a large number of points were available for deduction for any of the above, a 
ceiling on the number of points deducted was established. For example, although 35 points 
were possible for deduction for incorrectly sequencing activities, a maximum of 10 were 
deducted. This rule had to be applied in several cases. 

In the second part of the Teacher as Curriculum Decision-Maker task, the teacher 
provided a rationale for the activities chosen for the unit. A set of possible rationales was 
devised prior to the scoring training by the task developers; additions were identified during 
the training. A teacher received from 0 to 3 points per rationale mentioned, depending on 
whether it was appropriate and, if appropriate, on the depth with which it was explained. 
Clearly inappropriate rationales received a deduction of one point. The Parent/Student 
Letter task was scored similarly, with the first part covering reasons for taking the course 
scored with 0 - 2 points per reason, and the second listing additional topics to be covered in 
the letter scored with 0-3 points per topic. 

The Lesson Planning task had two parts. In the first part, the teacher listed 
strengths and weaknesses of three alternative lessons which filled a gap in a specific unit of 
lessons and, together with the other lessons, addressed a set of unit objectives. The unit 
plan, except for the missing lesson, and the unit objectives were provided. Teachers 
received one point for each distinct but appropriate strength or weakness; one point was 
deducted for inappropriate responses. Scorers worked from a previously established list of 
strengths and weaknesses, but were free to award points if, in their professional judgement, 
they believed that the candidate response, though not on the list, was valid. 

In the second part of Lesson Planning; the teacher provided an alternative design for 
the missing lesson and described its strengths and weaknesses. The description was to 
include student performance objectives, key concepts to be taught, the sequence of classroom 

3.9 



51) 



events with anticipated times of completion for each activity, and homework. The teacher 
responses were scored for both organization and content. Each was scored on a four-point 
scale, with 0 points for responses judged to be among the worst, If' below average 
responses, 2 for average responses, 3 for above average responses, ai.d 4 for responses 
judged to be among the best. 

The Gunroom and Facility Safety task had three parts. In the first part, the 
teacher was asked to list categories of activities that teachers can do with their students at 
the beginning of the year to promote laboratory safety for the term (excluding facilities). In 
the second part, the teacher was again asked to list specific thingB to do to promote safety 
and reduce the likelihood of accidents, but the context was that of a specific laboratory 
activity. In the third part, the teacher was given a diagram of a section of a science 
classroom and asked to identify safety hazards. Parts one through three were scored in a 
similar manner. Teachers received 1-2 points for each appropriate and distinct response 
they listed, depending on the specificity and/or depth with which the category was described. 
Unlike some of the other tasks which asked scorers to use their judgement to award one 
versus two points, specific criteria were provided to differentiate between the one-point and 
two-point responses. One point was deducted for each inappropriate response, with a 
maximum of two points deducted. 

This assessment was designed to produce a licensure decision in the most cost- 
effective yet reliable manner possible. The process of scoring by creating a set of proper 
responses and measuring how many the teacher identified (and allowing for original ones) 
captures how well a teacher does or does not do a designated task. If the ultimate set of 
tasks which constitute the assessment are deemed to represent a sufficiently broad sample 
of tasks that are critical to success in teaching, the test should be sufficient for purposes of 
licensure. However, since there is little information on the extent to which a teacher 
exhibits specific teaching competencies either within or across tasks, this assessment is less 
useful for yielding diagnostic information for staff development or beginning teacher support 
purposes. 

Scorers and Their Training 

Scorers were recruited mainly from the task development team. As a result, three 
out of the four scorers had participated in the development of the tasks. Their extent of 
participation ranged from conceptualization and review of materials to major development 
work on one of the tasks piloted. The scorers included two current science teachers and two 
district science specialists who worked with beginning teachers. 

3.10 



ERIC 



Scorers were asked the degree of knowledge of science and of science teaching needed 
to accurately score the assessment. Their general consensus was that minimal knowledge 
was needed, although two of the four scorers specified a context of an experienced science 
teacher overseeing a small group of scorers. FWL staff believe that the degree of knowledge 
of science and science teaching needed for accurate scoring varies from task to task. Both of 
the two FWL observers of the scoring training were experienced teachers, but were not 
trained in science. They each found it easier to judge general pedagogical principles (e.g., 
built on rrevious instruction) than to judge aspects that were more content-related (e.g., the 
appropriateness of the homework assignment in LoMon Planning). For many tasks, several 
additions were made to the list of acceptable responses during the scoring of ten sample 
responses. A non-science teacher might not be able to recognize acceptable responses which 
were not on the original list. No data are available to estimate the frequency with which 
novel acceptable responses occurred. 

Training for scoring all tasks was similarly structured but conducted separately. To 
calibrate f-.e scorers (i.e., make sure each was scoring similarly), the following process was 
used: First, copies of the stimulus materials and scoring guide were distributed, and scorers 
read through them. The trainer then asked the scorers to score one teacher response. The 
response was then analyzed, point by point, and scoring of each part was discussed. When 
scorers disagreed with the trainer, the rationale underlying the scoring was discussed, and a 
decision was reached on how to score similar responses. Sometimes this involved a greater 
understanding of how to apply and/or refine the existing scoring criteria; sometimes this 
entailed adding a response category to the original set for which credit was to be given. 
This process was repeated until the responses of approximately ten teachers had been 
scored by the group. (The developer indicated that when he trained similar groups of 
scorers for statewide assessments, 50 responses were used for the calibration phase.) 
Scorers then evaluated teacher responses on their own. Each teacher was scored by two 
scorers. A trainer checked the ratings for each individual teacher. If scorers were two or 
more points apart in their total score for a task part, then they were asked to confer and 
resolve the scoring discrepancy within one point. 

For one or two subparts of some tasks, (Lesson Planning, Applying Effective 
Instructional Techniques, and Classroom and Facility Safely), the original scoring criteria 
were extensively revised due to problems in their implementation. Problems in the stimulus 
materials were identified in the course of scoring. These problems tended to be a lack of 
information that allowed teachers to make wrong assumptions or a need for more specificity 
in directions to more clearly indicate the focus of the desired response. Generally, these 

3.11 



problems were minor. The stimulus materials needing major revision were one part of 
Applying Effective Instructional Techniques that focused on teaching students of diverse 
cultures. The problems identified in the stimulus materials and scoring criteria suggest a 
need for more extensive pilot testing prior to actual administration. 

All scorers evaluated their training as "very good," the highest rating available. 
Three of the scorers specifically praised the calibration portion of the scoring training 
devoted to discussion of the application of the scoring criteria. The only suggestion for 
improving the training was to continue refining scoring criteria to reduce the time required 
for scoring each task. One scorer also suggested that requiring the candidates to write 
legibly in dark ink might reduce the eye strain which she experienced. 

The training of scorers exhibited many of the principles of good instruction. Scorers 
received a clear introduction to the task. Trainers monitored scorers' performance and 
adjusted instruction according to the results. Multiple examples were provided. The 
examples were randomly chosen, not chosen deliberately to illustrate different scoring 
decisions. For the most part, this worked well. 

The training would have been strengthened by the inclusion of more examples. In 
similar training for assessments in the legal profession, the trainer uses fifty sample 
responses for calibration instead of ten to ensure variability among the sample responses. 
FWL staff believe that more examples would have been especially helpful for cases where 
the scorer was required to choose between 1, 2 or 3 points for a single appropriate item. 

Assessment Content 

In the following pages, the content of the secondary life/general science teacher 
assessment is evaluated along these dimensions: 

• Congruence with the 1990 California Science Framework; 

• Extent of coverage of California Standards for Beginning Teachers; 

• Job~relatedness of the instrument; 

• Appropriateness for beginning teachers; 

• Appropriateness across different teaching context (e.g., grade levels, 
subject areas); 

• Fairness across groups of teachers (e.g., ethnic groups, gender); and 
9 Appropriateness as a method of assessment. 



3.12 



As was true of all of the assessment instruments pilot tested this spring and summer, 
there was not sufficient time during development to conduct a larger content validity study. 
Without such a study, our ability to comment on the assessment's appropriateness along 
such dimensions as job-relatedness, appropriateness for beginning teachers, and 
appropriateness across contexts is limited. Thus, excluding the first two dimensions of 
curriculum congruence and standards coverage (which are based on FWL staffs analysis of 
the documents involved), the discussions of the remaining dimensions are based on the 
perspective of the participating teachers and scorers, and FWL staff, as reflected ic feedback 
forms, in informal conversations with the scorers and in an: lysis of the scores. 

The discussion of the content begins with a comparison of the instrument with the 
preliminary edition of the 1990 Science Framework for California Public Schools, 
Kindergarten Through Grade Twelve. 

Congruence with California Model Curriculum Guides and Frameworks 

The California State Department of Education periodically produces subject-specific 
documents, curriculum guides and frameworks, which serve as public statements describing 
the curriculum which contend and pedagogy experts believe is most appropriate for 
California school children. The most recent document pertaining to science is the 
preliminary edition of the Science Framework for California Public Schools, KiLdergarten 
Through Grade Twelve (California State Department of Education, 1990 - referred to in 
this report as the Science Framework). The reader should note that this framework was in 
development at the time of the development of the assessment, and therefore was not 
available to assessment developers; nonetheless, as the current statement of expectations for 
the California science curriculum, it will be utilized as r. standard to which the assessment is 
compared. 

The Science Framework is divided into three parts. Each part will be discussed 
separately, with a description of the main themes of each part followed by a discussion of 
whether or not the prototype tasks are consistent with the themes. FWL staff evaluations 
are summarized in Table 3.2. 

Part I of the Science Framework discusses general characteristics of science to be 
emphasized in science classes, including the nature of scientific inquiry to be modeled and 
the thematic organization of instruction across the curriculum. With respect to the nature 
of scientific inquiry, two of the three lessons critiqued in the Secondary Life/General Science 
Teacher Assessment's Lesson Planning task contain student practice in observation and 

3.13 

54 



TABLE 3.2 



COVERAGE OF THE CALIFORNIA SCIENCE FRAMEWORK 
BY THE SECONDARY LIFE/GENERAL SCIENCE TEACHER ASSESSMENT 



Content 


Method of Coverage i 


Extent of 
Coverage 


Parti: What Is Science? 






-Nature of scientific inquiry 


-Reflected in activities portrayeo 
in all tasks. 


Pull 

run 


-Thematis structuring of content 


—Not explicitly aaaresseu in any 
task. 


Monp 


Part II: Content of Science 






-Physical Sciences 


-Not represented. 


None 


-Earth Sciences 


-One topic used for a partially 
developed task. 


Limited 


-Life Sciences 


-Two topics used a.nong all 
tasks. 


Partial 


Part III: Achieving the Desired 
Curriculum 






-Thinking processes emphasized 


-Application reflected in the 

partially developed task on 
student misconceptions and in 

Parent/Student Letter. 


Fartial 


-L*wei-specific guides 


-Most high school level 
goals addressed by aspects 
of 1-2 tasks. No representation 
of middle school curriculum. 


Limited 


-Teaching Science to historically 
underrepresented students 


-Some attention in Applying 

Effective Instructional 
techniques. No representation 
of LEP students in any task. 


Limited 



3.14 



55 



analysis of data together with understanding why observed results need not be perfectly 
consistent with predicted results to support a theory. The rest of the tasks, with the 
possible er option of Teacher as Curriculum Decision-Maker, address different aspects of 
science instruction, and there is a consistent use of hands-on activities to illustrate concepts 
being taught throughout the assessment. 

The emphasis on thematic structuring of content in the Science Framework 
addresses the use of themes both across courses and within a course. With respect to the 
former, tiiis specific kind of articulation of course content both with previous science courses 
at earlier grade levels and with other science courses at the same grade level would be 
difficult to illustrate within a unit, much less through a single lesson. It is also questionable 
if all beginning teachers could be expected to articulate course content at such a grand scale. 
However, the thematic structuring within a course could be reflected in the present tasks 
with slight modifications of the contextual information and stimulus materials. Information 
on the theme(s) emphasized could be included in the contextual information for all tasks 
and the extent to which instruction reflects the given theme(s) could be incorporated into 
the scoring criteria for Lesson Planning and Teacher as Curriculum Decision/Maker. In 
addition, the current tasks all focus on high school classes. Since high school classes tend to 
be more specialized, forms of tasks which address middle school classes might exhibit the 
thematic emphasis more clearly. 

Part II discusses specific content to be presented at different grade levels and how it 
might vary in presentation according to the themes emphasized. The topics chosen for 
representation in the assessment could be more diverse. The tasks pilot tested focus on 
only two of the fifteen topics described under curriculum content, both within the life 
science curriculum: "Living ThingB" and "Cells, Genetics and Evolution." (An 
overrepresentation of topics from the life sciences is to be expected, as the assessment 
covers both life science and general science; physical science teachers would take a separate 
assessment.) In addition, one partially developed task involved classification of minerals, 
part of "Geology and Natural Resources". 

Some of the content of lessons portrayed, e.g., the lesson in Applying Effective 
Instructional Techniques, seems to be more characteristic of the instruction called for at 
earlier grade levels. One complication is that the content called for in the Science 
Framework represents a model to work toward and does not reflect the content presently 
taught at specific grade levels, especially at the elementary level. Thus, until the content 
taught is more in alignment with the State Science Framework, a policy decision may be 

3.15 



needed as to the extent to which a secondary science assessment should reflect either the 
grade-level content in the Science Framework or the content typically taught at that level. 
Whatever the ultimate decision, the present task shells could easily lie modified to reflect a 
greater variety in representation of content as well cs the content deemed appropriate for 
secondary science. 

Part HI discusses implementation of the desired curriculum. While the chapters 
discussing programmatic implementation at the school district and site level and criteria for 
the adoption of instructional materials are distant from the responsibilities of typical 
beginning teachers, the chapter on "Science Processes and the Teaching of Science" describes 
desirable characteristics of science instruction which are applicable to the classroom. These 
characteristics include thinking processes to be emphasized, guides for science programs 
across grade levels, and an emphasis on teaching science to the historically 
underrepresented (females, most minority groups, and the disabled) and Limited-English- 
Proficient students. 

The Science Framework calls for an emphasis on the thinking processes of observing, 
communicating, comparing, ordering, categorizing, relating, inferring, and applying. All 
thinking processes are not to be taught in all grades, however, as theories of child 
development suggest that young children develop theee skills sequentially, and roughly in 
the order listed. While all antecedent skills are reinforced and refined at all levels, inferring 
is to be introduced and stressed in grades six through nine, and applying in grades nine 
through twelve. 

As all tasks concentrated on high school classes, the skill of applying is most 
pertinent. The Science Framework discusses this skill in the context of learning to use 
scientific knowledge to think about current problems. The only task in which this is 
specifically done is the task examining students' scientific misconceptions, where the 
students discuss a current problem in light of the science which they have just learned. In 
addition, scoring criteria for the task ParentiStudent Letter imply that students will learn 
to apply their scientific knowledge tc current problems. 

Activities portrayed in the tasko pilot tested included several instances where the 
students were required to use the antecedent skills, particularly observing when conducting 
experiments in Lesson Planning or Teacher es Curriculum Decision-Maker, comparing as 
when two types of cells are contrasted in Teacher as Curriculum Decision-Maker, and 
inferring during activities analyzed as part of Lesson Planning. Task shells could easily be 
revised to include a focus on reinforcing and/or developing various types of thinking skills, 

3.16 



9 

ERIC 



57 



and a variety cf thinking processes could represented across the set of tasks which 
compose an assessment. 

The Science Framework provides level-specific guides for science programs. For 
secondary science instruction, both middle school level and high school level guides are 
provided. However, no middle school leva! courses were portrayed in the current forms of 
tasks developed. The goaio emphasized for high school science programs in the Science 
Framework are shown in the following bulleted paragraphs in italics, followed by a 
description of the extent to which the assessment reflects each goal. 

• Build on e solid foundation of science instruction in kindergarten through grade 
eight. At present, elementary schools are in various stages of aligning the 
content of their science instruction with previous Science Frameworks. However, 
most elementary schools do not yet teach all the content described in the 1990 
Science Framework, so content previously taught in elementary grades would 
need to be specified in any task which addressed this aspect of science 
instruction. At present, no tasks specify the science content in the 
elementary curriculum which teachers should assume the students 
experienced. 

• Lead in a coherent fashion to greater opportunities for all students. This goal 
stems from a desire to make science comprehensible to a wider range of student, 
especially students whose limited mathematical experience may have prevented 
them from meeting prerequisites for science courses. It also calls for more 
integration of the science curriculum and less discipline-oriented courses which 
emphasize the common foundation of basic principles of physics, chemistry, and 
biology. 

The collection of tasks pilot tested portrayed a number of different classrooms; 
however, most of the tasks focus on students who do not plan to go to college, a 
group which in the past would have been less likely to take science courses than 
college-bound students for whom science courses have been required. One of the 
tasks pilot tested presumes that the students in the classes for which the teacher 
is performing the task is evenly split between college-bound and non-college- 
bound students; three tasks focus on classrooms of non-college-bound students; 
and the remaining task focuses on a classroom of mostly college-bound students. 
Both Lesson Planning and Teacher as Curriculum Decision-Maker contain scoring 
criteria which focus on the teacher's ability to recognize specific aspects of the 

3.17 



r>s 



lessons or activities which are especially appropriate or inappropriate 
for most non-college-bound students. Teachers reported difficulty with 
this aspect of the assessment; this will be described in detail later in 
this section of the chapter. 

With respect to integration of the science curriculum, all the tasks focus on 
classes at the high school level, which would be expected to be more specialized 
than those at other grade levels. The Science Framework cites examples of 
integration such as a biology class examining the physics of motion and the 
concept of work and machines when discussing bones and muscles. FWL staff 
finds no such instances when other scientific disciplines are integrated in the 
tasks that were pilot tested. 

• Help students understand the nature of science ~ in particular, its experimental, 
nondogmatic nature and the methods by which progress is made. This is the 
nature of scientific inquiry previously discussed with respect to Part I of the 
Science Framework. The activities portrayed in the tasks and the 

scoring criteria are consistent with this emphasis. 

• Develop in students a strong sense of the interrelationship between science and 
technology and an understanding of the responsibility of scientists and 
scientifically literate individuals to both present and future societies. No task 
pilot tested specifically reflects this goal. One of the scoring criteria for a 
partially developed task addressing using computers as tools in science education 
is whether or not the teacher recognizes that a weakness of the lesson portrayed 
is the missed opportunity to link the lesson to career options and relevant real- 
world uses of data bases. 

• Foster each students ability to act as an independent investigator and thinker 
rather than a "recipe follower." Lofton Pluming explicitly in> udes this as one of 
its scoring criteria for the lesson developed by the candidate, i.e., that the lesson 
is not merely rote learning. Many other criteria for judging the analysis of 
lessons provided in Lesson Planning are focused on developing this ability in 
students. A few examples of relevant scoring criteria include recognizing the 
improvement of students' problem solving skills, practice in data collection and 
analysis, the manipulation of materials, and the demonstration of several 
scientific principles (e.g., that more samples lead to more valid results) as 



3.18 



ERIC 



59 



strengths and not involving enough students directly in the activity 
and asking too many similar questions in the activity as weaknesses. 

• Reinforce basic tools of language and mathematical communication. This goal 
calls for more integration across subjects so that students receive reinforcement 
for writing and mathematical skills in classes other than English and 
mathematics, and practice writing and mathematical problem-solving with topics 
which would normally be found in other classes, such as science. Although 
instances where the student communicated in writing or needed to use certain 
mathematical skills were portrayed in the tasks pilot tested, the specific emphasis 
portrayed in the Science Framework was not specifically reflected in any task 
pilot tested. 

• Provide an expanded view ofscie* se-related careers. This was not addressed in 
any of the tasks pilot tested. However, on* rf the scoring criteria for a partially 
developed task addressing using computers as tools in science education is 
whether or not the teacher recognizes that a weakness of the lesson 
portrayed is the missed opportunity to link the lesson to career 

options. 

There was some attention to historically underrepresented students in the task of 
Applying Effective Instructional Techniquei, mainly in equitable and non-racist instruction 
and encouragement of students. The Science Framework also suggests providing diverse 
role models, providing extracurricular enrichment opportunities, building parent 
involvement and peer recognition programs, and building on prior student knowledge to 
either draw on or augment student background knowledge. 

Instruction of Limited-Engiish-Proficient students was not addressed in any of the 

tasks. 

While the present collection of tasks onh> partially covers the specific emphases in 
the latest Science Framework, none of it is in contradiction to the framework. Task shells 
could easily be modified to cover a larger portioi of the Science Framework. 



3.19 

fid 



Extent of Coverage of California Standard* for Beginning Teacnew 



The California Beginning Teacher Standards are criteria for teacher competence and 
performance which the Commission on Teacher Credential ng expects graduates of 
California teacher preparation programs to meet. Listed uelow are brief italicized 
descriptions of Standard's 22 through 32 which pertain to expectations of student 
competencies to be attained prior to graduation from teacher preparation programs. (The 
remaining standards address programmatic requirements.) To evaluate this assessment 
instrument and make inferences about the assessment approach which it represents in 
terms of the appropriateness for use with California secondary life and general science 
teachers, the stimulus materials and scoring criteria for each task were compared with the 
11 California Beginning Teacher Standards. Each standard will be discussed separately. 

Standard 22: Student Rapport and Classroom Environment Each candidate 
establishes and sustains a level of student rapport and a classroom environment that 
promotes learning and equity, and that fosters mutual respect among the persons in a class. 
Although this was not measured directly, in Applying Effective Instructional Techniques, 
teachers need to identify appropriate and inappropriate teacher interactions with students. 
These include both appropriate and inappropriate responses to students, appropriate and 
inappropriate use of discipline, and instances of inequitable treatment of students and 
racially insensitive remarks. Because this skill includes a teacher's interpersonal and group 
management skills, it :a difficult to simulate through a transcript. Its complete 
measurement probably relies upon direct observation. 

Standard 23: Curricular and Instructional Planning Skills. Each candidate prepares 
at least one unit plan and several lesson plans that include goals, objectives, strategies, 
activities, materials and assessment plans that are well defined and coordinated with each 
other. These skills are at the heart of the Teacher at Curriculum Decision-Maker exercise 
where a teacher needs to be sole to choose and sequence activities into a unit of instruction 
which meets the learning objectives, includes differing approaches to learning, and uses 
classroom time efficiently. The teacher also chooses an appropriate assessment activity as 
part of this exercise. The Laacon Planning task also requires these skills. Even though 
teachers only evaluate single lessons, the evaluation includes whether each lesson 
contributes toward meeting the unit objectives, correctly presents content which is 
appropriately sequenced in relation to previous and subsequent lesson topics, and includes 
appropriate activities for the grade level and achievement level of the students. The aspects 
of curricular and instructional planning mentioned in the standard are important 



3.20 



61 



contributors to a teacher's score on these two tasks, Teacher aa Curriculum Decision-Maker 
and Lesson Planning. 

Standard 24: Diverse and Appropriate Teaching. Each candidate prepares and uses 
instructional strategies, activities and materials that are appropriate for students with 
diverse needs, interests and learning styles. Both the Teacher ai Curriculum Decision- 
Maker and Lesson Planning tasks evaluate whether the activities and materials chosen by 
the teacher are appropriate to the specific group of students described in the contextual 
information provided. Moreover, in Teacher as Curriculum Decision-Maker, teachers are 
evaluated on whether they include a variety of activities among those provided. This variety 
could be constructed to represent diverse learning styles. In the introductory material for 
each task, students are described in terms of their grade level and sometimes plans for 
education beyond high school, but not in terms of interests or learning styles. It would be 
possible to slightly revise the tasks mentioned to more completely address this star ^ard by 
including student interests and learning styles in the contextual information provided, 
revising the activities provided to the teacher to include both appropriate and inappropriate 
activities given the interests and learning styles described, and adding scoring criteria which 
evaluate the match between chosen activities and the students described. 

Standard 26: Student Motivation, Involvement and Conduct Each candidate 
motivates and sustains student interest, involvement and appropriate conduct equitably 
during a variety of class activities. In the task Applying Effective Instructional Techniques, 
teachers are asked in several instances to identify where a teacher uses appropriate or 
inappropriate techniques to motivate, involve, or discipline students. However, responding 
to a transcript of classroom interactions only captures limited features of the complex task 
of motivating and equitably sustaining student interest, involvement and appropriate 
conduct. 

Standard 26: Presentation Skills. Each candidate communicates effectively by 
presenting ideas and instructions clearly and meaningfully to students. The Applying 
Effective Instructional Techniques task contains several instanc in which teachers need to 
identify either appropriate or inappropriate representations of concepts or structuring of the 
lesson. In Lesson Planning; teachers are asked to describe a lesson that they would teach. 
Part of the criteria by which this lesson is judged is the extent to which the concepts relate 
to objectives, are appropriately sequenced from easy to more complex, are reflected in 
classroom activities and homework, and are scientifically correct. A task which was not 
fully developed addressed student misconceptions of scientific concepts, and could easily 
contain a component asking the teachers to describe how they would explain a key concept. 



3.21 



Any aspect of this standard which addresses such performance aspects of presentation as 
whether a teacher speaks loudly and clearly enough to be understood, however, would be 
difficult to capture with pencil and paper tests such as the Secondary Life/ General Science 
Assessment. 

Standard 27: Student Diagnosis, Achievement and Evaluation. Each candidate 
identifies students' prior attainments, achieves significant instructional objectives, and 
evaluates the achievements of the students in a class. Of the tasks piloted, structuring of 
lessons to achieve significant instructional objectives was addressed by Teacher as 
Curriculum Decision-Maker and Lessen Planning. The Teacher ae Curriculum Decision- 
Maker task includes selection of an appropriate summative evaluation instrument for the 
class described. 

None of the tasks pilot tested address student diagnosis or evaluation. However, two 
other partially developed tasks address this standard. Student diagnosis is the focus of a 
task which focusses on common student misconceptions about phenomena which conflict 
with accepted scientific theories. The task requires teachers to evaluate the strengths and 
weaknesses of instruction embodied in a script of teacher/student interaction which exhibits 
student misconceptions. Teachers diagnose or evaluate the student responses, analyze two 
possible scenarios of remedial instruction, and outline their own lesson(s) to correct the 
student errors exhibited. 

Another task focuses on the evaluation of student performance. The task requires a 
candidate to (1) determine if a set of instructions is clear or what modification need to be 
made to make the assignment clear; (2) determine if responses by another grader are 
accurate and appropriate; and (3) list common factual errors in a group of student papers 
and design a lesson to correct those misconceptions. 

Standard 28: Cognitive Outcomes of Teaching. Each candidate improves the ability 
of students m a class to evaluate information, think analytically, and reach sound 
conclusions. Even though cognitive outcomes are not addressed, the evaluation of lessons 
designed by the candidate in Lesson Planning includes whether each lesson contributes 
toward meeting the unit objectives, correctly presents content which is appropriately 
sequenced in relation to previous and subsequent lesson topics, and includes appropriate 
activities for the grade level and achievement level of the students. The incompletely 
developed task which requires teachers to diagnose students' scientific misconceptions also 
measures their ability to recognize effective remedial strategies which would foster the 



3.22 



63 



ability of all students to evaluate information, think analytically, and reach sound 
conclusions. 

Standard* 29: Affective Outcome* of Teaching. Each candidate fosters positive 
student attitudes toward the subjects learned, the students themselves, and their capacity to 
become independent learners. Applying Effective Instructional Techniques contained 
several instances where a teacher needed to identify appropriate or inappropriate actions 
which would affect student attitudes toward science and/or other students. Lesson Planning 
addressed some aspects of fostering a student's capacity to become an independent learner, 
i.e., sequencing concepts within the lesson from easy to more complex and assigning 
homework of appropriate difficulty which is not rote. 

Standard 30: Capacity to Teach Cross-culturally. Each candidate demonstrates 
compatibility with, and ability to teach, students who are different from the candidate. The 
differences between students and the candidate should include ethnic, cultural, gender, 
linguistic and socioeconomic differences. One segment of Applying Effective Instructional 
Techniques contains several inappropriate teacher remarks to students which are ethnically 
or culturally insensitive. No other issues of diversity listed in the standard, i.e., gender, 
linguistic, socioeconomic, were explored. 

Standard 81: Readiness for Diverse Responsibilities. Each candidate teaches 
students of diverse ages and abilities, and assumes the responsibilities of full-time teachers. 
Although the shells for the task include classroom context as one of the variables to be 
manipulated, the set of tasks piloted focus on high school (mostly tenth grade) classes. 
There is some diversity in terms of ability levels of students, and teacher responses need to 
take ability level into account in most of the exercises. There is no reason why the ability 
levels and school contexts could not be varied more widely if the assessment approach were 
used for credentialing. 

Standard 32: Professional Obligations. Each candidate adheres to high standards of 
professional conduct, cooperates effectively with other adults in the school community, and 
develops professionally through self-assessment and collegia! interaction wi Jj other 
members of the profession. Although the Ptrent/Student Letter addresses communication 
with adults (i.e., parents), it focuses on one small aspect of cooperation with parents, 
informing them of the learning objectives and content of a particular science course, and 
general classroom policies. The scoring criteria for the letter focus on the topics included in 
the letter, and not on whether the letter is comprehensible to the parent population. 
Moreover, the language used in the example in the instructions would be difficult to 

3.23 

64 



comprehend for moat adulta who are unfamiliar with the technical terms used, much less for 
Limited-English-Proficient parents: 



A topic might be "lab safety" and an example would be: 
"student will wear approved protective goggles when doing 
laboratory experiments where a splash hazard exists." 

The only other task depicting other adults in the school community is Classroom and 
Facility Safely, where teachers draft a memo concerning a specific safety issue to be 
distributed to all science teachers in the district. 

Table 3.3 lists the standards and FWL staff's evaluations of the extent to which the 
assessment methodology covers each standard, based on reviews of the fully and partially 
developed task shells together with a consideration of alternative tasks that might easily be 
developed. A "fuil" rating indicates that multiple dimensions of the standard impact a 
teacher rating, even if these dimensions are not scored separately. A "partial" rating 
indicates that some dimensions affect the rating, but some important dimensions are 
unexamined. A "limited" rating indicates that some dimensions affect the rating, but most 
important dimensions are unexamined. 

The previous two sections have addressed the congruence of the assessment with 
state standards. To address other aspects of the content, teachers and scorers completed 
surveys soliciting their perceptions of the appropriateness of the assessment along a number 
of dimensions: job-relatedness, appropriateness for beginning teachers, appropriateness 
across contexts, fairness across groups of teachers, and, finally, a general evaluation of the 
appropriateness of this method of assessment. Their perceptions, together with data on 
teacher performance, are summarized in the remainder of this section. 

Job-Relatedness 

Both teachers and scorers were asked whether the tasks chosen were relevant to the 
job of teaching life/general science at the secondary level. 

Teacher perceptions. Teachers agreed that the tasks were relevant to their job of 
secondary science teacher. Approximately 92% (12 of 13; the relevant page was missing 
from the survey form for 19 teachers) of the teachers responding to the question who 
completed form A and 76% (25 of 33) of those completing form B responded affirmatively. 



3.24 



TABLE 3.3 



EXTENT OF COVERAGE BY THE SECONDARY LIFE/GENERAL LIFE SCIENCE 
TEACHER ASSESSMENT OF CALIFORNIA STANDARDS FOR BEGINNING TEACHERS 



Standard 


Tasks Addressing Standards 


Extent of 
Coverage 


22: 


Student Rapport and Classroom 
Environment 


— Mppiying ciiccuvc 
Instructional Techniques 


Partial 

I CM MCtl 


23: 


Curricular and Instructional 
Planning Skills 


-Tcnr as uurnc uecision-MaKer 
-Lesson Planning 


Pull 

run 


24: 


Diverse and Appropriate 
Teaching 


-Tchr as Curric Decision-Maker 
-Lesson Planning 


Partial 


25: 


Student Motivation, 
Involvement and Conduct 


-Applying Effective 
Instructional Techniques 


Limited 


26: 


Presentation Skills 


-Applying Effective 

insirucuuridi \ tsunruijuco 
-Lesson Planning 


Partial 


27: 


Student Diagnosis, 
Achievement and Evaluation 


-Tchr as Curric Decision-Maker 

-Lesson Planning 

-Student Misconceptions 
-Evaluation of Student 

Performance 


Full 


28: 


Cognitive Outcomes of Teaching 


-Lesson Planning 
-otuoent Misconcepuons 


Full 


29: 


Affective Outcomes of Teaching 


-Applying Effective 

u loll UUUUi lai icuiuiiv^uuo 

-Lesson Planning 


Limited 


30: 


Capacity to Teach 
Crossculturaily 


-Applying Effective 
Instructional Techniques 


Limited 


31: 


Readiness for Diverse 
Responsibilities 


-Partial 


Partial 


32: 


Professional Obligations 


-None 


Limited 



3.25 



Pew teachers elaborated on their response; those teachers who singled out a single task for 
praise mentioned Classroom and Facility Safety. The positive responses included: 



/ really enjoyed the tasks. They were wry appropriate. 

Subject content: Not more specifically, became I teach in a 
middle school, and the time allotment is much less for each 
topic. Everything else: yes. 

Of those teachers who did not believe that the tasks were relevant, their criticism 
mostly addressed the content embedded in the tasks, and not the tasks themselves, as 
illustrated by the following comments: 

I felt tl.e lesson planning section, the examples, were not 
appropriate for the particular grade level I teach. It should be 
more grade oriented. 

I will touch only briefly in genetics and if it is covered in tenth 
grade classes that is fine, but I teach 7 grade Life Science and it 
does not get into that much depth. 

At my school site, we have a safety coordinator who is in charge 
of storing chemicals, etc. Therefore I have no contact with 
most of the chemicals displayed. 

Some teachers, however, did question the ability of the specific tasks to measure 
their teaching skills, as illustrated in the following quote: 

Yes! The tasks are relevant...! do employ im'tructional 
techniques..! do lesson planning... I maintain classroom safety. 
However...This test does little to accurately assess my skills in 
these areas (Except for the Classroom Safety ~ Section 3B - it 
is fine). 

Scarer perceptions. All four scorers agreed that the tasks were relevant for 
secondary science teachers. One scorer expressed concern about the ability of non-inner city 
teachers to catch the subtlety of the dialogue in Applying Effective Instructional Techniques. 
Another scorer who participated in designing the assessment thought that the assessment 

3.26 



9 

ERIC 



67 



could serve as a useful tool for shaping the curriculum in teacher training institutions: "If 
weaknesses in candidate responses are throughout the assessment, then the instrument 
should drive the programs in universities and districts." 

Generally, both the scorers and the teachers believed that the tasks chosen were 
relevant to the job of secondary science teachers. For the most part, those who disagreed 
believed that the content contained in the task, and not the task itself, was inappropriate. 

Appropriateness for Beginning Teachers 

Because the focus of this assessment is on beginning teachers, who are still in the 
initial stage of professional development, one issue pursued was whether or not the tasks 
were too easy or too difficult. 

Perceptions. When asked whether they had "sufficient opportunity to acquire the 
knowledge and abilities needed to respond in a reasonable manner to the assessment 
questions," 92% (12 of 13) of the teachers responding to the question with respect to Form 
A and 70% (23 of 33) of the teachers completing Form B responded affirmatively. Some 
teachers who believed themselves well prepared went on to emphasize that their experience 
teaching played a critical role in their preparation: 

As a second year teacher, I feel that I have developed some of 
the skills needed to take this test. But I have done so only 
because I have survived two very stressful years in teaching. 

Yes, but only because I have experienced a full load of teaching 
and worked on some of the frustrations, problems, etc. for at 
least one year. I do not think I would have been as well 
prepared after my student teaching because (1) no specific 
attention was given to safety; (2) my students for student 
teaching were very good, motivated, etc. I may not have been 
as aware of the need to build esteems and motivate if T he* not 
taught in two other classrooms after student teachi' 

Teachers who did not feel well-prepared gave a number of explanations. Some 
teachers mentioned the topic: 



3.27 

BS 



J felt very uncomfortable trying to write a lesson on genetics 
when I have never taught genetics. Could this be a fair 
evaluation of my competency? 

I was a geology major! My biology background is very weak. 

Some mentioned a perceived need for more or more varied classroom experience: 

J believe a lot of the assessment questions require being in the 
classroom for a few years and gaining the experience to know 
what is appropriate and what is not. 

Knowing about weaknesses and strengths in lesson planning 
comes with a lot of experience and trial and error. 

A- ' 'ties, yes. Not necessarily all of the knowledge - a lot of 
tL -tyou pick up from the specific course(s)you teach. 

Finally, some criticized the tasks themselves: 

Many unanticipated and unexpected things happen in a 
classroom and they are difficult to anticipate and, therefore, 
difficult to assess methods to prevent or eliminate them. 

I don't think there is a reasonable response to the assessment 
questions. They are very poor. The only assessment questions 
that were reasonable were the safety ones. 

But the questions, being out of context, are absurd. 

When the veachers were asked if they found any tasks or parts of tasks too difficult, 
38% (12 of 32) of those completing Form A and 42% (14 of 33) of those completing Form B 
replied, "Yes." Nine teachers identified Teichwr as Curriculum Decision-Maker for reasons 
such as the following: 

There appear to be too many objectives to cover in a two week 
period in the Teacher as Curriculum Decision-Maker exercise. 

3.28 



ERIC M 



It was more of a jigsaw puzzle - moving pieces of paper around, 
trying to add up to 50 minutes. I am much more flexible in my 
planning. 

Using only those activities listed to develop a two-week 
curriculum. For example, I wanted to give a quiz after the first 
week, but I did not like the specific content of the only quiz 
listed - in terms of correlating with my first week's content. 

All were difficult from the point of view that most of the 
situations are out of context. I.e., [SESJof class, type of school, 
climate of classroom and hundreds of other variables. Although 
some of the situations seemed cut and dry, others were 
subjective relating to the above variables. 

For Form B, 13 of the 33 teachers identified Lesson Planning; 3 Applying Effective 
Instructional Techniques, and 2 Classroom and Facility Safety. Teachers did not typically 
give lengthy explanations of their choice. The teachers selecting Lesson Planning 
mentioned: (1) the topic, as in "In 7th grade we do not go into a three week course in 
genetics;" (2) difficulty in understanding the kind of response expected, as in "Lesson 
planning wasn't difficult, but I didn't know how much to write, how thorough to be;" and 
(3) technical problems, as in "Having to come up with a lesson plan without having 
knowledge of what was specif? ally taught previously." The few teachers who found 
Applying Effective Instruction! Techniques difficult either felt that it was too "taxing" to 
identify and explain both appropriate and inappropriate actions or felt that it was difficult 
to evaluate out of context, preferring to see either a real teacher or a video. The teachers 
choosing Classroom and Facility Safety found critiquing the drawing depicting the storage 
of chemicals difficult. 

The scorers evaluated the difficulty of the tasks from a different perspective, having 
seen both the expected answers and the teacher responses. According to the scorers, 
teachers had problems with three tasks in particular: Parent/Student Letter, Classroom and 
Facility Safety, and Lesson Planning. Many teachers seemed to have no experience in 
writing parent/student letters, exhibiting a lack of ideas of what might be included. With 
regard to safety, one scorer remarked, "Many teachers seemed unprepared to deal with the 
classroom situation and the storage area. This is so important, it should be on all formats 
and carry weighted points. Once the teachers and prep program advisors know there is 
accountability, the performance level will improve." Both scorers of Lesion Planning 

3.29 



70 



commented on the lack of expertise exhibited by the teachers, described by one scorer as 
"abbreviated, shallow candidate answers resulting in very low scores." This scorer went on 
to suggest that perhaps candidates felt rushed, yet only one teacher completing this task 
reported a need for more time while 1/3 of the teachers (11 out of 33) identified it as being 
too difficult. 

While the teachers for the most part believed that they had an opportunity to 
develop the skills required of the tasks, the scorers disagreed. On numerous occasions 
during the scoring, various scorers commented on what they perceived as the inadequacy of 
current teacher training programs based both on their experience with new teachers and the 
answers being scored. The scorers expressed hope that an assessment such as this one 
would provide guidance for curriculum development and feedback on the performance of the 
graduates of teacher preparation programs. 

Performance on assessment tasks. Table 3.4 shows a statistical portrait of teacher 
performance on the assessment tasks. (Both the scores and the number of possible points 
were doubled, as scores were formed by adding the scores of the two scorers.) Teacher 
scores suggested that the content was difficult for teachers. Teachers as a group did best on 
Teacher as Curriculum Decision-Maker and Parent/Student Letter. The lowest scores were 
recorded for the portions of Lesson Planning that required teachers to analyze another 
teacher's lesson plan. 

To do well on a task, a teacher had to pay attention to many simultaneous factors 
and attend to most of the cues provided, e.g., lesson objectives, classroom composition. 
While the multiplicity of cues reflects the complexity of classrooms, the burden is on the 
teacher to process a great deal of information and to place the same significance on certain 
cues as the assessment developers, who were recognized experts on science teaching. 

Teachers reported difficulties when faced with unfamiliar situations. This appeared 
across tasks, as when teachers reported unfamiliarity with the topic in L*seon Planning or 
when the teaching approach they commonly used was not congruent with the activities 
provided in Teacher as Curriculum Decision-Maker task or when they had never seen, let 
alone written, a parent/student letter. Teachers also seemed to have trouble with designing 
instruction for a group of students who were different than their own. This was reflected 
not only in relatively low scores and what one scorer called "shallow" responses, but also in 
a common criticism of the activities in the tasks as inappropriate for their own students, 
despite clear directions that they were to plan for a different group of students. The 
scorers/assessment developers were confident that new teachers could be trained to design 

3.30 



ERIC 



71 



TABLE 3.4 



TEACHER PERFORMANCE, BY SUBPART, ON THE SECONDARY 
LIFE/GENERAL SCIENCE TEACHER ASSESSMENT 



• 

..... 

• ' ■ ■ - •:• - Xv.v.-.-.v.v.v. :•: :•: :•: :•: : . 


^^^n^^ersScpres^ 
Standard 


rOliilS 




• . . .-. . . . .•. .......... 

Subpart * : 


llMepl 


Deviation 




Possible 




Form A 










Task 1 : 


Applying Effective 
Instructional Techniques 












Segment 1 
Segment 2 
Segment 3 
Segment 4 


4.8 
7.3 
7.9 
7.6 


3.0 
3.3 
2.7 
3.1 


32 
32 
32 
32 


18 
20 
14 
14 


Task 2: 


Teacher as Curriculum 
Decision-Maker 












Parti 
Part II 


113.9 
17.8 


20.9 
8.5 


32 
32 


200 
33 


Task 3: 


Parent/Student Latter 












Parti 
Part II 


9.8 
17.7 


4.1 
10.9 


32 
32 


16 
30 



* Since each teacher response was double-scored, the scores were derived by 
summing the two ratings. 



3.31 



TABLE 3.4 (Continued) 



TEACHER PERFORMANCE, BY SUBPART, ON THE SECONDARY 
LIFE/GENERAL SCIENCE TEACHER ASSESSMENT 





teacher Scores* 


Points 
Possible 




X;:::;:;X:i; : ; : :::';:;i^ 

: :':':.riean : .'. ; :' 


standard 
Deviation 


N 


Subpart 


Form B 










Task 1 • ADolvina Effective 










Instructional Techniques 










Segment 1 


5.3 


3.3 


33 


18 


Segment 2 


7.7 


3.3 


33 


20 


Segment 3 


4.0 


3.7 


33 


14 


Segment 4 


15.7 


5.6 


33 


H £ 


Task 2* Lesson Plannina 










Parti: Lesson A 


4.8 


2.4 


33 


30 


Part 1: Lesson B 


4.5 


3.3 


33 


26 


Part 1: Lesson C 


2.2 


1.6 


33 


14 


Part II: Organization 


4.0 


1.9 


33 


8 


Part II: Content 


3.2 


1.5 


33 


8 


Task 3: Classroom/Facility Safety 










Part 1 


13.1 


5.3 


33 


28 


Part II 


10.2 


3.9 


33 


42 


Part III 


15.2 


6.2 


32 


52 

I 



•Since each teacher response was double-scored, the scores were derived by 
summing the two ratings. 



3.32 



9 

ERIC 



73 



instruction for different groups of students, but research has identified this as an area 
which distinguishes new teachers from expert teachers (Leinhardt, 1983; Wilson, 1988), 
suggesting that this may be a skill that develops later in a teaching career. New teachers 
generally have experience with a very limited range of students, those who they taught 
during student teaching and those taught during one or two years as a regular teacher. Can 
they effectively build on the more in-depth experience as a full-time teacher to think about 
what may be appropriate for different types of students or are they so caught up in 
classroom management, time management, and lesson planning that the issue of tailoring 
instruction develops later? This question cannot be answered by any data we have, but is 
key to ascertaining the appropriateness of the assessment for beginning teachers. 

Appropriateness across Contexts 

The assessment is designed so that the teaching context can be varied. The tasks 
piloted were very homogeneous with respect to grade level. All tasks focussed on high 
school classes consisting mainly of tenth graders. However, 54% of the teachers taking the 
teat taught students who had not yet reached the ninth grade; two of these teachers taught 
sixth grade in a middle school with their single subject science credential. 

The tasks were more heterogeneous with respect to students. One task, Teacher ss 
Curriculum/Decision-MakBr, featured students who were not planning to attend college. 
These students were described in the following way: "Though students read at grade level, 
thoy are not used to having extensive assignments and many do not even complete minimal 
homework." Classroom and Facility Safety and Parent/Student Letter featured a class of 
students who were non-college bound, ranging from ninth to twelfth grade, though 
containing mostly tenth graders. Evaluating Effective Instructional Techniques focussed on 
a class fulfilling a college entrance requirement, but did not specify the kinds of students in 
the focal classroom. Lesson Planning portrayed a class where half the students planned to 
attend college, and half did not. Classroom and Facility Safety did not specify the type of 
students, although the class itself is described as non-college preparatory. 

Teachers were asked their perceptions of the appropriateness across contexts on two 
dimensions: with respect to teachers at different grade levels, and with respect to teachers 
of diverse types of students. 

Grade level. Teacher perceptions of the appropriateness of the assessment across 
contexts differed according to the form completed. Teachers completing Form A 
overwhelmingly (75% or 24 of 32 teachers) agreed that the assessment was appropriate 

3.33 



71 



across grade levels. Only 48% (16 of 33) of the teachers completing Form B, however, 
agreed. Dissenting teachers completing either form, however, tended to agree that they saw 
the assessment as problematic for junior high and middle school teachers, as exemplified by 
the following comments: 

J think junior high teachers would haw a hard time with 
appropriate high school level activities which test asks for. 

Lesson Planning was too in depth for 7th grade advance ESL 
science class. We have to break down the lesson Qtep-by-atep 
due u. the language problem that sometimes occurs. 

Teachers teaching junior high school have more to deal with as 
far as student achievement levels. 

The materials given for the unit planning are too advanced for 
junior high students. Although, teachers are supposed to be 
able to cover all grade levels even though they may leach at a 
different level from that which they were assessed. 

Some junior high teachers in response to earlier questions also remarked that they 
did not cover topics in the depth required by the Lesson Planting and the Teacher ai 
Curriculum Decision-Maker tasks. 

Scorers did not comment, either positively or neg. avely, on the grade-level aspects of 
the instrument's appropriateness. 

Diverse students. Teachers were asked whether they felt that the assessment wes 
"appropriate for science teachers of diverse student groups (e.g., different student ability 
levels, different ethnic groups, handicapped or Limited-English-Proficient students, different 
school/community settings." Sixty-nine percent (22 of 32) of the teachers completing form A 
and 56% ( 19 of 33) of the teachers completing form B believed that it was appropriate. 
These teachers supported their response with comments such as the following: 



3,3* 



It is still teaching. 

hi all but extreme cases. 

However, more diverse examples could be used. 

Teachers who disagreed generally cited types of students who were not represented 
in the stimulus materials or the generalizability of the settings used. 

Parts I and U both were biology. Many students do not take 
Biology. 

The lesson plans and self directed homework assignments are 
much too difficult for the students of my district. I have LEP 
and PL students who wouldn't be able to b idle the written 
work ~ especially the math. 

ril use my district as an example. Most of my parents don't 
speak English. My department has no equipment or budget for 
the labs described and no films (have to order one year in 
advance - impossible for a first-year teacher). 

It is not clear whether the teachers who criticized the assessment as including 
teaching techniques and conditions which were inappropriate for the students they taught 
realized that some context was provided in the instruction for the tasks and believed that 
more diverse contexts should be represented, whether they believed that their teaching 
context should have been represented for it to be a valid assessment of their teaching, or 
whether they missed the contextual remarks in the introductory materials for the tacks. 

Teachers are licensed to teach all students, so it seems reasonable to present varied 
groups of students in the stimulus materials to test whether or not a teacher knows how to 
vary instruction. However, this skill is known to be more characteristic of experienced 
teachers than beginning teachers (Leinhardt, 1983), so beginning teachers may have 
difficulty in completing tasks for students with whom they have limited or no experience. 
One teacher who did feel that this assessment was appropriate for teachers of differing 
student groups summed up the dilemma of addressing teaching diverse students in an 
assessment: 



3.35 



Teachers that have different ability levels and handicapped and 
limited English students shsiM be tested for this. But waiUaM 
teachers will probably be exposed to these types of students and 
should be tested on their ability to handle the problems that 
could arise. But.Jhave not been trained to handle these 
students and have trouble finding answers considering the lack 
of resource specialists, resources, materials, textbooks, and the 
ratio of students to teachers. In my school it is 34:1 on 
average! Tough question! 

All of the four scorers felt that the assessment was "suitable for new teachers in 
different school and community groups." Opinions ranged from "Regardless of the school 
and community setting, all teachers need to be aware of the components in each and all of 
the tasks" to "There is some potential for [being unsuitable]. However, I can think of no 
way to prevent this. A broad-based series of questions should not penalize any one type of 
teacher too much." 

Another perspective on the appropriateness of the tasks for teachers of diverse 
students was obtained through review of portions of two tasks by Dr. Sharon Nelson-Barber 
of Stanford University, a consultant who works with school districts and teachers of 
classrooms composed primarily of students outside of the dominant culture. The materials 
sent included both stimulus materials and scoring criteria for one segment of Applying 
Effective Instruction*! Techniques and materials from both parts of Lesson Planning, 
including one of the lessons to be critiqued and the lesson to be designed. 

Dr. Nelson-Barber praised the provision for rater recognition of appropriate 
responses which are not included on the scoring guide. However, she emphasized the need 
for test developers to consider a variety of perspectives as the scoring criteria are devised. 
One example she cited was the literature on effective black teachers' emphasis on strong 
adult leadership (Hollins, 1982; Delpit, 1988; Foster, 1989; Ladson-BillingB, 1889) as 
contrasted with more mainstream characterizations of good teaching as guiding and 
facili tating, i.e., deemphasizing the authority role. Aa an example of potentially conflicting 
notions of effective teaching, Dr. Nelson-Barber cited as an example an instance of teacher 
sarcasm in the transcript analyzed that was evaluated as an inappropriate teaching 
behavior. However, certain culturally sanctioned teasing behaviors or "put downs" built 
upon shared backgrounds and cultural understandings between teachers and students have 
been used very effectively with black inner-city college students (Foster, 1989). It is likely 
that any similar teacher responses that lie outside the scorer's range of cultural experience 

3.36 



ERIC 



77 



and/or knowledge, but which may represent culturally appropriate and highly effective 
practice within the context of the teacher's particular teaching context will be either ignored 
or negatively evaluated. One way to reduce the likelihood of this occurring is to require 
review of an assessment by a number of successful teachers working in culturally diverse 
settings. 

Fairness across Groups of Teachers 

Teachers were asked whether or not they felt this assessment was "fair to new 
teachers of both genders, different ethnic groups, different language groups, and other 
groups of new teachers." Teachers overwhelmingly believed that the assessment was fair to 
different groups of teachers. For those teachers completing Form A, 91% (29 of 32) of the 
teachers agreed that it was fair; for teachers completing Form B, 85% (23 of 33) agreed. 

One teacher supported her affirmative answer by noting the diversity of students in 
the prompt materials. Another teacher who felt that the test was fair believed that "these 
factors should not be a concern." 

Teachers who disagreed gave differing reasons. Three teachers did not cite specific 
groups for whom they believed the assessment to be unfair, but instead expressed their 
disapproval of the entire assessment and their belief that no one should have to take it. 

Two other teachers expressed concern for teachers of varying English proficiency or 
cultural backgrounds: 

The language/culture differences of teachers are not addressed 
by the wording of the test Someone who is not fluent in 
English may have difficulty with some terms. Also, some 
cultures may take a more regimented view of classroom 
management. 

Two other types of teachers elicited the concern of teachers: 

lama bilingual science teacher which should be assessed along 
with other mainstream classes I also teach. 



3.37 



73 



Some teachers who were taught by the university system may 
have gotten more experience when it came to writing lesson 
plans than teachers who were taught by other alternative 
means. 

Finally, one teacher believed that the test was fair only if "teachers are allowed to 
choose the area to be tested in, for example, «Cell Theory'." This comment mirrored the 
frustration of other teachers who described themselves as junior high school general science 
teachers faced with designing a laboratory in genetics. 

Three of the four scorers believed that the assessment was fair to all groups of 
teachers. As one scorer commented, "This assessment is directed to the skills needed to 
teach California public school science students. Teachers with various characteristics and 
teaching styles must be at least minimally proficient in the needed basic science teaching 
skills." The fourth scorer believed that "If the teacher does not have good command of 
English, this will be a problem." Whether or not "good command of English" was a 
necessary prerequisite for good science teaching was not addressed by this scorer. 

The xpert on teaching diverse students, Dr. Nelson-Barber, stressed the need for 
specifying the information that the candidate is expected to provide for each task. For 
instance, many members of the black community, particularly working class blacks, use a 
communicative style that devalues the expression of "obvious" information (Heath, 1983; 
Taylor and Lee, 1987). In responding to an assessment, a teacher may not display the full 
range of their knowledge because aspects considered to be "obvious" are not mentioned. 
Epistomological or communicational patterns from other cultures may present additional 
problems. 

Appropriateness as a Method of Assessment 

Teachers were asked directly whether or not they thought "this type of assessment is 
en appropriate way of assessing your competency in teaching secondary life and/or general 
science." About 77% (9 of 13; this question was inadvertently omitted from the surveys of 
19 teachers) of those responding to the question for form A and 52% (17 of 33) of those 
completing form B believed the assessment to be appropriate. 

Teachers responding positively had comments such as the following: 



3.38 



9 

ERIC 



79 



J think this is a good start - I think it assesses knowledge of 
teaching skills more than subject content. 

If the bugs are worked out and a realistic way to grade this 
mass of paperwork is discovered! 

My answer is yes and no because this could be son way of 
testing for competency. I feel classroom observations are 
important also. Actual teaching and a written assessment are 
two totally different things. 

More appropriate than the kind of assessment I received in my 
credentialing program. There was not check for competency 
other than knowledge of subject mid pre-arranged classroom 
observations. 

Teachers who did not feel that the assessment was appropriate for measuring their 
competence offered specific criticism, perceiving a lack of measurement of collegial and 
interpersonal interaction variables, a need for more measurement of teaching culturally 
diverse students, a need for greater emphasis on c< operative learning, and general 
skepticism about the ability of pencil-and-paper tests to indicate teaching competency. The 
following is a sample: 

If s not at all close to what itfs like to be a teacher. For 
example, when I send letters home to parents I always consult 
other teachers as I do when I plan a unit, etc. So much of 
being a good teacher depends on communicating with other 
teachers, with your personality around students, etc. 

Partially so, more emphasis should be placed on the assessment 
of culturally aware and sensitive teachers and their use of 
cooperative learning! No more book/lecture teaching!! 

There are B2 many variables in teaching. It is preposterous to 
think that a pen-paper test (in which I evaluate a script of an 
awful lesson or I develop a particular lesson for a particular 
subject for a particular grade for UNKNOWN students, sites, 



3.39 



materials, staff, etc.) will in any way be an indicator of teaching 
competency. 

In motivating students - it is not only science, science, science, 
to build relationships, the teacher needs to be a little more 
personal. Part IE-Safety - not all science teachers [go] into the 
stockroom with chemicals. There are separate stockrooms for 
life and physical [science]. 

Companion with other aiaeMmenti. Teachers were also asked the following 
question: "How does this assessment format (i.e., structured simulations) compare with 
others with which you have been evaluated (e.g., multiple-choice for CBEST and NTE 
Specialty Area Tests, classroom observation during students teaching) in terms of its 
assessment ability?" Roughly 50% (16 of 32) of those completing form A and 36% (12 of 33) 
of those completing form B gave answers that could be interpreted that they feel that the 
secondary general/life science assessment is better than the other assessments with 
which they have been assessed. Teachers particularly mentioned the CBEST and the NTE, 
the two multiple-choice assessments mentioned as examples. Sample comments are: 

More valuable than multiple choice because it allows for more 
complete communication. Probably as good as classroom 
observation because one does not feel so much "on the spot" and 
having to play to an audience. 

This assessment is much closer to real-life examples of 
teaching. All of these tasks are what practicing teachers need 
to perform during the course of their job. Multiple choice tests 
are limited in that they only test to see if you can recognize the 
appropriate response. 

Structured simulation is a great idea, it really does test things a 
teacher does on "n everyday basis. CBEST assesses our 
'professional skills' to see if we have minimum basic educated 
knowledge. NTE tests our knowledge of the content area - 
"Biology'- but I have yet to be tested for my ability to be "a 
teacher." 



3.40 



Nine percent (3 of the 32) of the teachers completing form A and 6% (2 of the 33) of 
those completing form B believed that this assessment method was inferior, compared with 
the others. The following comments illustrate specific criticisms; 

I feel the other tests (CBEST, NTE, etc.) were tests that could 
be scored fairly. These seem to be all gray areas. 

Believe NTE more fail and accurate. These assessments allow 
too many variables and ambiguities. Entirely too subjective. 

I feel the CBEST was a good indicator of basic skills, and I 
mean basic . I feel the NTE Specialty Test was a good indicator 
ofSubj'jct Area knowledge. I feel my student teaching 
evaluations were helpful in providing direction and in 
recognizing strengths and weaknesses. I feel that this test had 
nothing to do with the reality of teaching and would in no way 
be an indicator of my ability as a teacher. 

Twenty-two percent (7 of 32) of the teachers completing form A and 12% (4 of 33) of 
those completing form B did not offer an opinion of the relative merits of the Secondary 
Lifo/General Science Teacher Assessment and other teacher assessments. Instead, these 
comments indicated that the teachers believed that this assessment measured a different 
area than the specific assessments cited as examples: 

CBEST and NTE test knowledge of subject not teaching skills. 
Class observation is similar to this as tho observer is looking at 
bow the pen; n teaches and knowledge of subject This test 
looks more at teaching skills than the other assessments. 

CBEST and NTE are more comprehensive evaluations of 
subject area knowledge, this is a better evaluation of classroom 
management and knowledge. 

One thing I like about these tests was that they were assessing 
teaching ability, not necessarily content knowledge such as in 
the CBEST and NTE. Knowing a subject does not mean that 
one can teach it. 



3.41 



52 



Considering summary responses to both the general question about the 
appropriateness of the assessment and the explicit comparison with other assessment 
methods, most teachers approved of this assessment. Again, the teachers completing form B 
tended to be more critical than the teachers completing form A. 

Assessment Format 

Format Features 

This assessment format was a pencil-and-paper test with written stimuli which asked 
teachers to perform a series of tasks similar to those they encounter in their teaching. The 
tasks were developed through a "task shell" system where many different versions of a 
single task can be generated cost-effectively. 

Clarity of Preparatory Materials 

Prior to the assessnent, teachers received information which gave them a limited 
idet, of what they would be asked to do. Extensive preparatory materials were not 
developed for this pilot test. When teachers were contacted to solicit their participation in 
the assessment, they were told that the assessment consisted of approximately four hours of 
responding to structured simulation tasks with written prompts and written responses. The 
letter whic a they received confirming their participation described the assessment as 
consisting of "a set of structured simulation tasks depicting classroom management and 
instructional situations that a teacher of science might encounter." It also contained brief 
de* ptions of six possible tasks (including one which was not pilot tested) and a lis of 27 
por ble science topics that might be included. 

Teachers were generally satisfied with the level of description provided in the 
preparatory materials (although this was probably significantly affected by the fact that 
their performance on the assessment had no consequences for them). As shown in Table 
3.5, teachers generally believed that the preparatory materials were clear with respect to the 
description of assessment activities and the aspects of teaching being measured. They did 
not believe that the scoring criteria, which were not addressed at all in the preparatory 
information offered, were described clearly. 

Teachers were also asked if there was any additional information that would have 
been helpftil prior to the assessment Eight teachers suggested that specific examples of the 
test items wo'Ud be helpful, although some commented that it wasn't a problem (possibly 

3.42 



ERIC 



S3 



TABLE 3.5 

TEACHER PERCEPTIONS OF THE PREPARATORY MATERIALS 
FOR THE SECONDARY LIFE/GENERAL SCIENCE TEACHER ASSESSMENT 



...... , v . ................. 


Number and <H> of Tei 
tfiat Various Aaj 


a<$er$ Responding 
jectsoNhe 
Ascribed Clearly 




Assessment Were E 
nthePreparatc 


Hrytiateriats •' 


mo ■;■:,[ 

v 1 7-7, 7.77: ■ 7, 77:7.7 .7. .v.' V 7.777 : - : •?•:-••.•... • 
: . :.' . . 


Teacher* 


iyiyTaachers . • 


• ..• ■.. 
.. . . ' .. , •.. , .... • 

■. :■ .■ ■■■ •• • • • 

■ . • . - ; . . :. V. : : . - .v.:. :yr;:: ;. ■ ■ ^v.^- .i; • • ' 7! '-T-v :: ' : ; : >'•.•:• 


Completing 
Task A 


•.Hvv-Task:B-- . : 




# % 




Assessment Activities 


24 91% 


29 88% 


Scoring Criteria 


15 47% 


13 39% 


Aspects of Teaching Being Measured 


26 81% 


24 73% 


TOTAL N ;•: 




111133 



9 

ERIC 



3,43 



81 



because the test had no consequences for them). Some teachers wished they had known 
other details such as whether the format of the test was written or oral and how and by 
whom it would be scored. 

Clarity of Task Instructions 

Because this assessment was in the developmental stage, the focus of the evaluation 
form was on identifying problems in the task instructions which could have affected teacher 
responses. Teachers were only asked to elaborate on their negative responses; few teachers 
elaborated on positive responses. 

The clarity of instructions for each task was evaluated both through teacher reports 
and from observation of the scoring process for many of the tasks. Teachers were asked if 
the directions for each task were clear and, if they were not, to describe the difficulty 
experienced. As can be seen in Table 3.6, a majority of the teachers perceived the directions 
to be clear for each task, ranging from a low of 66% for Parent/Student latter to a high of 
88% for Claitroom and Facility Safety. Each task will be discussed separately, combining 
teacher and scorer comments with FWL staff observations. 

Teachers described several difficulties in completing Applying Effective Instructional 
Techniques, most of which could be addressed through revised instructions. Two teachers 
could not tell whether they were supposed to respond to everything the teacher said, or 
confine the response to what they considered significant. Another could not tell whether or 
not to limit the analysis to student/teacher interactions or whether instructional content 
should be critiqued as well. One teacher suggested that illustrating the method of recording 
responses for this task by a labeled example would reduce confusion. 

Teachers sometimes made assumptions which were not warranted by the stimulus 
materials. Teachers were penalized for assumptions that contradicted the information 
provided. Responses that depended on assumptions that were consistent with the stimulus 
material but went well beyond the information provided were ignored. In some cases, 
scorers recommended modifying the script or providing additional contextual information to 
eliminate some possible interpretations of classroom events. 

One aspect of Applying Effective Instructional Techniques which emphasized its 
artificiality was the "Dr. JeckyU and Mr. Hyde" nature of the teacher whose behavior swung 
back and forth from being exemplary to extremely inappropriate. Some of the inappropriate 



3.44 



TABLE 3.6 

TEACHER PERCEPTIONS OF THE CLARITY OF TASK INSTRUCTIONS 
FOR THE SECONDARY LIFE/GENERAL SCIENCE TEACHER ASSESSMENT 



Ill ||| : '" 


l^tor and % oty ^ache^ Indicating 
1 Directions to the Task WV>re Clear 


. v. . • .' y.' . : . '•. 1 '. .-. ." : '/ '■ . .' -.v.'.v .' . . . .V ' *. . [ '. 

• v.v.v. •. . 
■ ; [-yyyyyyyyyys-yyyy. <:y.-.-.y.y.-:->.-:-.-.y ^ ; 

:•: •:■ 
■ : ■ : . ' •: ■: ■ ••>>:-: v .: : :: . ■ ■ ■ ■ •■ : •. 

'yyyyyrryyyyy.yyryr.yyy.^yy.y^ 

■ .... 

;Task' v ..,'' : ' : 


Tea<?her« 
Completing 


^i.v.vTeftctiers. ' 
^Completing 
Task a 


Applying Effective 
Instructional Techniques 


24 75% 


23 70% 


Teacher as Curriculum 
Decision-Maker 


23 72% 




Parent/Student Letter 


21 66% 




Lesson Planning 




23 70% 


Classroom and Facility Safety 

TOTAL N. ....... 




29 88% 



3.45 

^ I) 



actions were also very obvious, such as when the teacher reprimands one student for 
applying makeup during class and ignores another. 

For Teacher as Curriculum Decision-Maker, teachers reported difficulty in shuffling 
the many pieces of paper and figuring out how to complete the table in the answer sheet. 
The answer sheet was designed for ease in scoring, where teachers recorded code numbers 
for each activity. This made it difficult, however, to keep track of the activities already 
recorded without continually referring back to the papers with descriptions of activities. 

For two administrations of this task, teachers were given scissors to cut apart the 
activities and physically reassemble them into the unit. Several teachers who experienced 
this version remarked on the amount of time it took to cut out the activities. The test 
developer had considered, and discarded, the idea of using index cards, but it is likely that 
this would solve some of the logistical problems that the teachers experienced. 

One teacher suggested that Part II, the portion where the teacher provides a 
rationale for the activities selected, and Part I be completed simultaneously, as it was 
difficult to reconstruct the rationale after the fact. Another teacher did not understand the 
format in which responses were expected. 

Both teachers and scorers reported that some of the estimated times for completing a 
laboratory were much too brief. The two scorers believed, in addition, that the de&criptions 
of some of the lectures and films needed to be elaborated in order for the teachers to 
appropriately evaluate them. 

The task which received the lowest percentage of teachers agreeing that the 
directions were clear was the Parent/Student Letter. Although one example was provided 
for each part, teachers reported being unclear on what was expected, requesting more 
examples. The scorers agreed that greater clarity in the directions as to the distinction 
between the two parts of the letter would have been beneficial, especially as it appeared that 
many teachers had never seen such a letter before. 

Teachers also reported being unclear on wheUer the responses were to be in a list 
form or written out as it would appear in a letter 

Despite a list of the elements in the lesson to be covered, some teachers found the 
portion of the Lofton Planning task where they were to write their own lesson confusing. 
Some teachers wanted additional information, such as the length of tho class period. 



3.46 



ERJ.C 



67 



Another teacher wanted to know what kind of students composed the class; either they did 
not notice the brief description of the classroom or they wanted additional information. 

Classroom and Facility Safety had the highest percentage of teachers reporting that 
the directions were clear; no teacher described any difficulties in completing this task. 

Length of Tasks 

Teachers were asked if they had sufficient time to complete each task, and to identify 
any task for which they needed mc:« time. For Form A, only 28% (9 of 32) teachers 
reported sufficient time to complete all tasks. . Twenty-four teachers suggested more time for 
Teacher as Curriculum Decision-Maker, six identified Applying Effective Instructional 
Techniques, and four singled out Parent/Student Letter. Two of these teachers reported 
needing more time to complete a]l the tasks. Estimates of the amount of additional time 
needed ranged from fifteen to thirty minutes. One dissenting teacher believed that the time 
limits should be reduced to forty-five minutes for each exercise. 

Teachers completing Form B wjre more satisfied with the time allotted. Eighty- 
eight percent (29 of 33) of the teachers reported no difficulties in completing the tasks 
within the time limits provided. Four teachers identified Applying Effective Instructional 
Techniques as needing more time, and one chose Lesson Planning. 

Clarity of Scoring Criteria and Procedures 

Scorers were asked if they had any difficulties in applying the scoring criteria for any 
of the tasks. Only one specific problem was reported: "The biggest problem was knowing 
how much to *read in' to answers (e.g., is 'denigrating primary language' the same as 'racial 
bias')." Two scorers noted that since the scoring guides were previously untried, many 
revisions were needed. To FWL staff who observed the scoring process, it seemed fairly 
straightforward to match teacher responses to specific scoring criteria; in some cases, it was 
difficult to apply the criteria which distinguished between the responses awarded one point 
and those awarded two points. 

When scorers were asked if fome tasks were harder to rate than others, only Teacher 
as Curriculum Decision-Maker and Lesson Planning were identified. However, each of these 
tasks was mentioned by both of the two scorers who graded them. Teacher as Curriculum 
Decision-Maker was described as "slow" or "tedious" to score. One part of that task, 
specifically referred to by one scorer, involved tb N application of a lengthy scoring algorithm 

3.47 



S3 



which checked for the presence or absence of numerous activities or patterns of activities in 
the unit plan, If the assessment were to be operationalized, that portion of the assessment 
would be keyed into a computer and scored through a scoring program. Lesion Planning 
was described as difficult to score because "candidates write poorly and have minimal skill 
in writing lesson plans." 

Cost Analysis 

Administration and Scoria? Cost Estimate 

The Secondary life/General Science Teacher Assessment tasks are administered in a 
large group setting. Thus, the tasks can be administered by one or more person* with little 
or no training in the specific content of the assessment using procedures common to 
standardized grou; test administrations. 

The largest component of the cost of this assessment is that of personnel. Scoring 
requires the training of raters knowledgeable in the content and criteria for the assessment. 
Scoring of the pilot test data, which included both training and actual scoring, required four 
days for two scorers for form A and roughly seven days for two scorers for form B. (For 
Form B, the scorers, who had also been part of the assessment development team, made 
extensive revisions in the stimulus materials and scoring criteria for some of the tasks. The 
tasks represented in Form B also had more subparts thru those in form A.) We estimate 
this time as minimal to insufficient for training and scoring an assessment <mch as this. 
With more fully developed scoring criteria which can be extended to other tasks within the 
same task shell and more fully developed stimulus materials, it is likely that the system 
could be implemented on a wide scale b? is. We will use the time and costs associated with 
scoring the pilot tests as the current best estimate fo* administering similar assessments. 

The pilot test i ivolved training four scorers and scoring 22-23 teacher responses (the 
remaining ten were used in training) to each of six tasks over a period of four to seven days. 
Training and scoring were conducted separately for each task. The amount of time required 
to score each task was more closely related to the number of its subparts than the length of 
time required by the teachers for its completion, but training, scoring, md some 
development work averaged 1 1/3 to 2 1/3 days per task, depending on the form. Based on 
this experience, we estimate that approximately two days per scorer would be required to 
train and score roughly 20 teacher responses to a jingle task. If a half day assessment 
consisted of three tasks, it would take apj. proximately six scorer-days to score twenty teacher 
assessments. According to this logic, five scorers should be able to score 100 teacher 

3.48 

er|c sy 



assessments resembling either Form A * Form B in six days, with periodic checks to insure 
that scorers are applying scoring criteria correctly. Assuming a cost of $160 per day for 
each scorer, this implies a cost of approximately $48 per teacher to train scorers and score 
an assessment If these same scorers were used again for a similar task shell, the training 
time might be shortened, reducing marginally the total scoring costs. 

Costs for test administration, duplication of materials, postage, travel, etc. would also 
need to be added to the costs for scoring the assessments. As we have outlined on other 
assessments, a cost of $30 per assessment for these activities assume minimal travel costs 
for test administrators. A summary of cost estimates for administering and scoring an 
assessment like this include: 

Training and Scoring: $48 per assessment 

Administration/Other: 30 per assessment 

Total Administration and 
Scoring Costs: $78 per assessment 

Development and Pilot Ttating Costs 

The costs for developing the Ive tasks for this assessment were $130,157 an 1 re 
broken cr*t by cost category in Table 3.7, which also includes costs for pilot testing. .These 
development costs are the er penses for the assessment developer to deliver prototype 
activities to the CTC and SDE. In addition, $45,211 was spent for the pilot testing of these 
tasks with 65 teachers. 

These data provide a rough indication of the magnitude of costs that would be 
incurred if a similar assessment were to be adapted for implementation. 

Technical Quality 

This section describes the process by which the assessment was developed, and 
discusses the reliability and validity of the assessment based on analyses of teacher 
performance, and refers to other analyses which pertain to evidence of validity. 



3.49 



TABLE 3.7 



DEVELOPMENTAL AND PILOT TEST COSTS FOR THE 
SECONDARY LIFE/GENERAL SCIENCE TEACHER ASSESSMENT 





Doyetopiwjht 




Staff-Salaries & Benefits 


$54,202 


$16,014 


Consultants 
(Teachers, assessors, 
and other consultants) 


U 


2,853 


Travel (Consultants ard 
staff) 


9,280 


9,142 


Other Direct Costs (Rand fee, 
site rental, phone, 
duplication) 


27,650 


7,731 


Total Direct Costs 


$91,132 


$35,740 


Indirect Costs 


39,025 


9,471 


Total Costs 


$130,157 


$45,211 



3.50 



9 

ERIC 



91 



Development 



The shell/task development process is iterative in nature. The process begins with 
brainstorming about the general features of a shell, but to facilitate mutual understanding 
of concepts, developers are encouraged to illustrate their ideas with concrete examples from 
their own teaching experiences. This requires relating broad generalizations f bout good and 
bad teaching practices to specific examples of it. 

These discussions provide a bridge between general concepts about good teaching 
practice (the craft) and concrete teacher behavior. This bridge helps the team flesh out the 
essential elements of a particular task. The fleshing out process also identifies factors that 
need to be included in the shell, e.g., the generic types of stimuli to which the candidates 
should respond and which responses are more or less appropriate. The team's discussions 
therefore shift back and forth between a focus on the general features of the shell and the 
specific elements of a task that would simulate those features in a realistic way. Sometimes 
a task is developed before its shell because only through the task construction process can 
the elements be identified that need to go into the shell. Usually about 4 to 5 teachers (and 
teacher educators) participated in the task development process. 

The assessment developers intended to pilot test the materials with at leant six 
prospective or new teachers before the materials were released for larger scale pilot testing. 
However, in a few instances, due to circumstances beyond their control, the initial 
shakedowns failed to take place. These shakedowns are viewed as an integral part of the 
development process, which follows a cyclical model of develop, test, revise, test again until 
the prototype task is considered complete. 

Reliability 

The following analyses were performed on the pilot test data of 32 teachers for Form 
A and 33 teachers for Form B. Interrater agreements could not be computed for reasons 
explained below. Internal consistency estimates were generated to assess the degree to 
which the variables or factors within each of the tasks would form a measure and the degree 
to which the different activities related to each other and might form an overall assessment 
of a candidate. 

Interrater agreement!. The process followed in scoring was that the scorers 
conferred on instances where the scores differed by two or more points. Scorers then 
changed their original scores. Therefore, interrater reliability estimates were not calculated 

3.51 

0 9 



for the pilot test data since the ratingB were not independent and reflected a consensus 
between scorers. 

Internal consistency of the tasks and asaeMmeni Coefficient Alpha reliability 
estimates were calculated for the tasks by using the individual ratings on subparts within 
each task. The reliabilities for the tasks and subparts are shown in Table 3.8. The 
reliability estimates for the tasks ranged from - .11 for Teacher aa Curriculum Deciiion- 
Maker on Form A to .62 for Lesson PUcningon Form B. These reflect a relatively low 
degree of internal consistency within the tasks. These results should be interpreted in light 
of the early and formative development of these measures. For example, the pilot test and 
scoring were u*nd to further refine the stimulus materials and scoring criteria. A more 
positive interpretation of the low internal consistency ia that the different subparts measure 
more independent factors of a teacher's performance. The lowest reliability estimate, for 
Teacher as Curriculum Decision-Makar, might be explained by the fact that it was clear that 
many teachers did not complete Part II, and 75% of the teachers identified this task as one 
needing more time for completion. 

In judging the "goodness" of these data in light of the developmental status of the 
instrument, it is helpful to reflect that the developer built this prototype using experience 
and models used with other licensing examinations; particularly state bar examinations. 
The developer states, "On the surface, one would think that the scores on two tasks created 
from the same shell would correlate more highly with each other than would either of them 
correlate with the scores on tasks created from other shells. That may happen, but I doubt 
that the differences would be very large. The unique features of a task, such as grade iovel 
and subject matter for the unit, may be more familiar to some candidates than to others and 
thereby influence scores. For this reason and others, no one task, by itself, is likely to be 
very reliable. And, the correlation between tasks - whether from the same or different 
shells - will not be especially high (expect low .20*s). Whether such a pattern of 
correlations is considered good or bad depends on the goals for the test. If all tasks 
correlate with each other to about the same degree regardless of whether or not they were 
created from the same shell, then this would undermine the position of those who want to 
use the test results for diagnostic and educational purposes, such as providing candidates 
with subscorcs for such things as 'lesson planning* or 'classroom management.' If, on the 
other hand, the purpose is to make a defensible pass/fail decision based on a general 
measure of teacher proficiency, then this pattern of intercorrelations is fine provided that as 
a group, the tasks simulate a wide range of important tasks that teachers should be able to 
perform and span the types of school contexts and subject matter areas to which the license 
applies," 

3.52 



ERjC 



1)3 



TABLE 3.8 



INTERNAL CONSISTENCY OF TASKS 
SECONDARY LIFE/GENERAL SCIENCE TEACHER ASSESSMENT 



Task* 


TasJt 

Iff MMS:!!?:": 


Form A: 




Applying Effective Instructional Techniques 


.50 


Teacher as Curriculum Decision-Maker 


-0.11 


Parol it/Qti irlont 1 attar 


23 


Form B: 




Applying Effective instructional Techniques 


.55 


Lesson Planning 


.62 


Classroom/Facility Safety 


.30 



3.53 



Intercorrelations among talks. Correlations among the three tasks of each Form 
were calculated for the 32 teachers completing Form A and 33 teachers completing Form B, 
and are reported in Table 3.9. Only the correlation between the Teacher at Curriculum 
Dedrion-Makt? and Parent/Student Letter was statistically significant. This pattern is 
again consistent with what the developer had predicted. If this pattern were to persist with 
further development and refinement of the assessment, it would imply that an overall 
decision using information across tasks would be based on multiple, relatively independent 
factors rather than an overall composite measure of a teacher's ability. As stated earlier, 
either pattern, i.e., multiple or single factors, is acceptable but the type of information and 
its use should be interpreted in light of the pattern(s). 

Validity of Agreement Through Group Comparisons 

Teachers participating in the pilot test represented different ethnicities, gender, 
teaching experience, etc. Examining differences among these might provide some tentative 
information about the validity of the assessment. For example, positive evidence would 
include that differences among ethnic or gender groups are minimal and differences among 
teachers with more or less experience and preparation support the assessment's 
sensitiveness and ability to measure any additional knowledge the training and experience 
might provide. Although differences between groups would be difficult to detect given the 
relatively low reliabilities ascociated with the current assessment, it may still be worthwhile 
to examine the differences for any patterns. Table 3.10 contains a summary of the trends 
for the pilot sample of 65 teacher candidates. Appendix A provides the means, standard 
deviations and numbers of candidates from which these summaries were constructed. A 
plus (+) indicates that the mean or average for the first group was greater than that for the 
second group. For example, the pluses under the Female-Niale column indicate that for 3 of 
6 tasks, the average female score was greater than that of the males. No notable differences 
were detectable on any of the variables where the groups were evenly split (3-3) or nearly 
evenly split (2-4). Whether the lack of differences is due to the characteristics and status of 
the assessment or due to the absence of differences among the groups is unknown at this 
point. 

With further development, it would be desirable to observe patterns such that 
teachers with more training and experience outperform those with less and that scores of 
teachers of different gender, ethnicity, or teaching location are not notably different. 

Content validity. Evidence of the content validity of this assessment comes from 
three sources. The first is the role that teachers and science educators have had in its 

3.54 



9 nr 
ERIC J0 



TABLE 3.9 



INTERCORRELATIONS AMONG TASKS 
SECONDARY LIFE/GENERAL SCIENCE TEACHER ASSESSMENT 



flllliikil 


•: : ;v: : : : .\v: : >: : '';v'X ; : ; x : :v.'X': : :v:';v:'; 


ii in 


Form A* 

i wi ill n» 






Applying Effective Instructional 
Techniques 


— 




TeachGr as Curriculum Decision-Maker 

1 \f 1 I \j I %A %J li III ^/ \m f Wi VII ww^ \0 \f W) VV II If 1 %A V ^ 1 


-.04 




Parent/Student Letter 


.31 


.39* 


Form B: 






Applying Effective Instructional 
Techniques 






Lesson Planning 


.11 




Classroom/Facility Safety 


.11 


.09 



*p<.05 



3.55 



TABLE 3.10 



TRENDS OF MEAN DIFFERENCES IN TASK PERFORMANCE BETWEEN 
CANDIDATES WITH DIFFERENT CHARACTERISTICS* 

SECONDARY LIFE/GENERAL SCIENCE TEACHER ASSESSMENT 







Teacher 
Prepara- 


\tevel#.gg 


Teaching 


Ethnicfty 


/: Activity 


Gender 

|Fe^aie/| 


■ton-- 
Regular/ 

liriteplli 


■T^chmgMi- 
HS/Midde 
br#.Higfr; 


Location 
Inner-City 


:'Nort*. ; '-." : '.- 
Minority/ 

Minority 


Form A 












Applying Effective 

instructional 

Techniques 




+ 








Teacher as 
Curriculum 
Decision-Maker 


+ 






+ 




Parent/Student 
Letter 


+ 






+ 


. i 


Form B 












Applying Effective 

Instructional 

Techniques 


+ 




+ 




+ 


Lesson Planning 




tie 








Classroom and 
Facility Safety 




+ 


+ 






SUMMARY 


3/6 


3/6 


3/6 


2/6 


2/6 



•Entries reflect the direction of the mean differences for the different candidates. For 
example, for Applying Effective Instructional Techniques, Form A, the average mean 
of male teachers in the pilot test was greater than the females. The individual 
differences for each task or activity do not generally represent statistically significant 
changes. 



3.56 

ERIC 07 



development. The second is the analyses of the match of the assessment to the model 
curriculum guide and California Beginning Teacher Standards that compares the 
assessment's content with that recommended in the official documents. The third is the 
type of concerns raised by the beginning teachers who participated in the pilot test. These 
analyses have been described earlier and implications for further development are described 
in the following section. 

Conclusions and Recommendations 

This section contains conclusions and recommendations regarding the Secondary 
Life/General Science Teacher Assessment, organized into the areas of administration, 
scoring, content, format, and a brief summary. 

Administration of Assessment 

Like ether large-scale examinations, the Secondary Life/General Science Teacher 
Assessment is administered simultaneously to a large number of people. Benefitting from 
many years' experience in conducting such examinations, the administration of the actual 
assessment poses few logistical problems. The only difference between this assessment and 
traditional large-scale tests is the requirement of additional surface space to ac> >mmodate 
the materials for each task. Although no trouble was experienced in locating facilities for 
this small pilot test, the additional space requirement may preclude the use of large lecture 
rooms or auditoriums equipped with small, easily-stored writing surfaces. 

Our experience in locating secondary life/general science teachers to participate in 
the assessment leads us to conclude that such teachers do not tend to be concentrated in 
concise geographic areas, even within large metropolitan areas. The administrative 
requirements of the assessment make it possible to be centrally administered to large 
groups, thus considerably reducing the administrative costs per teacher. However, it should 
be noted that the higher degree of centralization afforded by this assessment may place 
larger burdens on teachers from rural areas and the outer edge of metropolitan areas who 
would have to travel a long distance to a selected site. 

The development of a "shell" for each task permits the teaching context and science 
content to be varied while ensuring the comparability of tasks across time. Tasks where the 
correct answers are relatively independent of the content and context portrayed would pose 
security risks over time due to their use of formulaic answers which could be easily 



3.57 



memorized. Tasks whose answers are highly dependent on the com*a -*nd context 
portrayed are more suitable for variation over time. 

Scoring 

Scoring consists of checking teacher responses against a predetermined list of 
possible correct responses. Scorers judge response" which are not on the list according to 
their professional judgement, and are free to award credit to responses ji Iged to be 
acceptable which are not on the original list. For the most part, this metl odology worked 
well, although major revisions in either the scoring criteria or stimulus materials were 
needed for some of the tasks. This suggests the need for more extensive pilot testing of 
tasks prior to their administration as an assessment. 

Scoring training consisted of an orientation to the scoring guide, independent scoring 
of a sample response, and a group discussion of the resulting scores. Approximately ten 
sample responses were scored for each task subpart. This represents a departure from the 
assessment developer's recommended practice, where approximately fifty sample responses 
are used in the training. 

FWL has the following recommendations for revisions in the scoring training: 

• Provide more examples of scored responses, especially for those task 
subparts where partial credit is given for incomplete responses. 

• Although the scorers believed that minimal knowledge of science and science 
teaching is necessary to score the tasks, scorers should continue to be recruited 
from experienced science teachers until data is available to consider the effects of 
the use of scorers with lesser qualifications. 

If the ultimate set of tasks which constitute the assessment represent a sufficiently 
broad sample of tasks that are critical to teaching success, the Secondary Life/General 
Science Teacher Assessment should be sufficient for purposes of licensure. However, since 
there is little information on specific teaching competencies either within or across tasks, 
this assessment is less useful for yielding diagnostic information for stuff development or 
beginning teacher support. 



3.58 



99 



AiMMment Content 



Our observations and information collected from scorers and teachers pr cipating 
in xhe pilot test suggest the following conclusions about content: 

• Modifications of the tasks are necessary to bring the assessment in to closer 
congruence with the latest Science Framework, especially with respect to the lack 
of variety in the science content portrayed, no representation of thematic 
structuring of science content, and the lack of representation of the middle school 
curriculum. 

• Coverage of the California Standards for Beginning Teachers varies. Every 
standard is addressed to some extent. The standards addressing curricular and 
instructional planning skills, student diagnosis, achievement and evaluation, and 
cognitive outcomes of teaching are most completely addressed. Standards 
receiving limited attention include student motivation, involvement and conduct, 
affective outcomes of teaching, the capacity to teach crossculturally, and 
professional obligations. 

• Most of the teachers and all of the scorers believed that the tasks were relevant 
to the job of secondary science teachers. 

• Most of the teachers agreed that they had been sufficiently prepared to respond 
reasonably to the tasks. However, over one-third of the teachers reported 
difficulty with one or more of the tasks. The largest number of teachers 
identifying specific tasks as difficult cited the two tasks related to instruction, 
Lesson Planning and Teacher as Curriculum Decision-Maker. Scorers identified 
Lesson Planning; Classroom and Facility Safety, and the Parent/Student Latter as 
being particularly difficult for the teachers participating in the pilot test. 

• Teachers held mixed opinions as to whether or not the assessment was 
appropriate for teachers of different grade levels. Concern was expressed for 
middle school or junior high school teachers, since the curriculum at that level 
was not represented in the assessment. 

• Between half and two-thirds of the teachers believed that the assessment was 
appropriate for teachers of diverse studont groups. Teachers' criticism of the 
assessment reflected discomfort with being asked to desif n instruction for 

3.59 

4 K 0 



students they had never taught or to evaluate teaching methods which 
they did not use. 

• An expert on teaching diverse students cautioned that unless culturally diverse 
models of appropriate teaching are built into the scoring criteria, scorers may be 
unable to recognize culturally appropriate responses that lie outside their range 
of cultural experience and/or knowledge. Teachers from particular cultural 
communities who are teaching effectively in that community may be penalized as 
a result. 

• Teachers and scorers overwhelmingly believed that the assessment was fair to 
different groups of teachers. The expert on teaching diverse students, however, 
cautioned that unless every effort is made to reduce possible instances of 
miscommunication, teachers who are not from the dominant culture may not 
display the full extent of their knowledge about teaching. 

• Between one-half and three-quarters of the teachers believed that this assessment 
method is appropriate for measuring their skills as secondary science teachers. 

Assessment Format 

The assessment format is a pencil-and-paper test with written stimuli which asks 
teachers to perform a series of tasks similar to those encountered in teaching; responses are 
compared to a predetermined set of correct responses. 

Based on evaluations by teachers, scorers, and FWL staff, the following modifications 
in the tasks are needed: 

9 Directions for Applying Effective Instructional Tachniquei should be revised to 
more clearly indicate the type and form of re? ponses teachers should make. The 
teacher's behavior in the script should be mc re plausible, i.e., not shift from 
exemplary to extremely inappropriate. Roughly 15% of the teachers identified 
this task as needing more time for completion than the hour provided. 

* For Ttasehar as Curriculum Decision-Maker, teachers should be given the 
activities printed on index cards to facilitate sorting. Descriptions of activities 
and estimated times for their completion should be double-checked; descriptions 
should contain enough information to enable the teacher to understand their 

3.60 



9 

ERIC 



101 



content and possible strengths end weaknesses, Three-fourths of the teachers 
reported needing time beyond the ninety minutes provided to complete this task. 

• Directions and examples which make a more clear distinction between the 
responses expected in the two parts of the Parent/Student Letter are needed. 

• Although some teachers believed that additional information was needed in order 
to complete Lesson Pluming; almost all of the additional information cited was 
already provided. Some experimentation with the format in which it is presented 
may be needed in order to assist teachers in locating it. 

• No teachers reported difficulty in completing the Classroom and Facility Safety 
task. 

Summary 

The methodology used in the Secondary Life/General Science Teacher Assessment 
has been successfully implemented in the application portion of examinations for licensure 
of lawyers. While minor revisions are still needed to obtain a fully developed prototype, 
pilot test results suggest that it could be successfully replicated in teaching. However, the 
nature of its scoring system, while suitable for licensure decisions, is less suited to yielding 
diagnostic information to inform staff development and/or beginning teacher support. 



3.61 



K2 



CHAPTER 4: 
SCIENCE LABORATORY ASSESSMENT 



If 3 



CHAPTER 4: 



SCIENCE LABORATORY ASSESSMENT 



The Science Laboratory Assessment is an observation system developed by RMC 
Research Corporation in Mountain View, California. As its name suggests, the context of 
the assessment is a science laboratory activity. This activity may be conducted either in or 
outside the classroom (e.g., a science field trip), but it must be student-centered, hands-on, 
and inquiry-oriented. The major portion of the assessment consists of an observer using the 
Science Laboratory Assessment instrument to conduct a 30-45 minute (minimum) 
observation of the science laboratory activity, focusing on seven domains of teacher 
performance. 

The seven domains of teacher performance are deliberately broad in scope to 
represent aspects of teaching at all grade levels, in all subject areas, and in a variety of 
settings. The seven domains are: Pedagogy, Content, Materials/Equipment, Management, 
Knowledge of Students, Climate, and Communication. 

For each of the seven domains, there are from between two and nine elements which 
help define the domain being observed. An example of one domain and its four defining 
elements is as follows: 

Domain: Materials/Equipment 

Elements: Teacher Use, Sale Setup, Safe Practices, 
Availability 

Each element is further defined by indicators which describe the type of teacher 
performance to be observed and examples of behaviors or events that provide evidence for 
use in the assessms "t. A complete list of the assessment's domains and elements can be 
found on Figure 4.1, page 4.2. Figure 4.2 gives an example of the defining indicators for 
three elements of the Mstoriak/Eqmpment domain. (See Appendix B for a complete 
description of the domains, elements, and indicators, as well as the materials used for the 
documentation and analysis process.) 



4.1 



1C.J 



FIGURE 4.1 
UST OF DOMAINS AND ELEMENTS: 
SCIENCE LABORATORY ASSESSMENT 



IQtyUlH 



A. PEDAGOGY 

Al. Planning 
A2. Sequence 
A3. Prelab 
A4. Directions 
A5. Explanation/ 
Presentation 

A6. Monitoring/Adjusting 
A7. Feedback 
A8. Questioning 
A9. Closure 

B. CONTENT 

Bl. Accurate 

B2. Integrated 

B3. Related to Objectives 

C. MATERIALS/EQUIPMENT 

CI. Teacher Use 
C2. Safe Setup 
C3. Safe Practices 
C4. Availability 

D. MANAGEMENT 



E. KNOWLEDGE OF STUDENT'S 
El. Diversity 

E2. Student Characteristics 

F. CLIMATE 

Fl. Interactions with 

Students 
F2. Interactions among 

Students 
F3. Attitudes 
F4. Inquiry 

G. COMMUNICATION 

Gl. Speaking 

G2. Writing 

G3. Listening 

G4. Strength of Presence 



Dl. Grouping 

D2. Other Personnel 

D3. Routines and Transitions 

D4. Student Engagement 

D5. Timing 

D6. Student Behavior 

D7, Lab Cleanup 



4J2 



ERIC 



105 



FIGURE 4.2 



THREE ELEMENTS AND DEFINING INDICATORS OF THE 
MATERIALS/EQUIPMENT DOMAIN: 
SCIENCE LABORATORY ASSESSMENT 



C1. TEACHER USE 



C2. SAFE SETUP 



C3. SAFE PRACTICES 



The teacher properly uses the equipment and handles 
the materials employed in the observed laboratory 
activity. Live organisms are maintained and handled in 
a humane and appropriate manner. Where applicable, 
the teacher is alert to student allergies, fears, and 
other problems related to the use of specimens or live 
organisms in the science lab activity. 

The setup of equipment, furniture, and materials has 
no serious irregularities or dangerous conditions. The 
setting has, as needed, adequate ventilation, first aid 
supplier, safety equipment, corrosive-resistant 
counter tops, a fire extinguisher, running water, good 
lighting, etc. Materials and equipment are stored, 
labeled, and moved properly. 

The teacher knows about the potential dangers 
involved in the planned science laboratory activity. 
The teacher informs students about, checks for 
understanding of, and enforces the proper use of 
equipment and handling of materials, as needed. The 
teachor tells students about safety procedures, 
potential dangers and actions to take, and proper 
cleanup and disposal procedures. Students are 
wearing safety gear (e.g., goggles, aprons, gloves) 
when needed. Cleanup and disposal are completed in 
a well-coordinated and safe manner. The teacher 
is alert to potential safety problems, knows what 
to do if a safety problem occurs, and takes corrective 
measures when necessary. There are no observed 
teacher violations of state and federal safety laws 
and regulations on the setup, use, and handling 
of materials and equipment. 



4.3 

l r c 



Although the majority of evidence corresponding to the domains, elements and 
indicators *omes from the actual observation, the Science Laboratory Assessment also 
provides for evidence to be collected by the observer from three other assessment 
components: (a) a Pre-Observation Questionnaire completed by the teacher, (b) a 20-30 
minute Pre-Observation Conference with the teacher, and (c) a 15-20 minute Post- 
Observation Conference with the teacher. 

As is often the case with high-inference, observation instruments, a key feature of 
the Science Laboratory Assessment is its documentation and analysis process. This process 
entails extensive scripting during the observation and then a rewriting of the data in a 
specific manner on another form. These two steps must be done before the observer gives 
any ratings of the teacher's performance. 

A distinctive feature of the Science Laboratory Assessment is the part of its 
documentation process called guided note-taking. Instead of requiring the observer to script 
the entire lesson as accurately as possible in a chronological manner (as is done with some 
high-inference observation instruments), the guided note-taking process requires the 
observer to categorize the evidence and notes from the lesson at the same time as it is 
scripted. That is, the observer categorizes the information from the observation by domain 
simultaneously with recording it. To facilitate this procedure, the observer scripts all 
evidence and notes on a specially-designed Guided Note-taking Form (GNF) which i* 
divided into seven spaces corresponding to the assessment's seven domains (see Appendix ). 
Typically, an observer may use 12-15 of these forms to record data from a single 
observation. 

Other distinctive features of the Science Laboratory Assessment are its 
Documentation Sorting Record and a Summary Report Form. Upon compie* u of the 
observation, the observer takes the information from the Guided Note taking Forms and 
further categorizes (or sons) the data by element on a seven-page Delimitation Sorting 
Record (DSR). Data from the pre- and post-observation conferences an.! the questionnaire 
is also sorted by element on the DSR. 

After reviewing all the information on the DSR the obser-er uses a Summary Report 
Form to rate the teacher's performance on each domain and to enter an overall rating. 



4.4 



ERIC 



107 



Using a two-point rating system, the observer is instructed by the Assessor 's Handbook to 
give a rating of "2" if "s/he feels the teacher V performance is minimally acceptable or better," 
and a rating of "1" if the performance is not acceptable. For each rating, the observer is also 
asked to provide three or four corresponding summary remarks. 

For this pilot test, a single observation per teacher was deemed sufficient for trying 
out the assessment instrument. If this assessment were to be used as the primary data for 
credentialing purposes, the developers of the assessment suggest that a minimum of four 
observations should be conducted for each new teacher, with the new teacher conducting a 
different type of lab activity (e.g., exploratory, illustrative) in different content areas each 
time. 

The administration of the Science Laboratory Assessment in this pilot test, the 
content of the instrument, and the assessment format are discussed below. The content and 
format sections of the report contain information from the teacher and assessor evaluation 
forms, as well as information and analysis of scoring results. Following these are 
discussions on cost analysis and technical quality of the prototype assessment. The chapter 
concludes with an overall summary with recommendations for further steps in exploring the 
feasibility and utility of high-inference, subject-specific observation instruments such as this 
in California. 

Administration of Assessment 

Following an overview of the administration of the assessment, this section contains 
information on the following: logistics (e.g., identifying the teacher sample, scheduling 
observations), security, assessors and their training, scoring, and perceptions of the 
administration by teachers, assessors and FWL staff members. 

Overview 

As with any observation system, the administration of the Sc/ence Laboratory 
Assessment required careful planning and coordination on the part of the observers, the 
new teachers, and the school administrators. Observers and new teachers had to be 
recruited and scheduled, and observers also had to be trained. Moreover, because this 
observation system is content-specific, new teachers had to be carefully matched with 
assessors of the appropriate science background (e.g., life or physical science). Over 
approximately a six-week period which ended in June, 1990, eleven trained assessors 

4.5 



irs 



observed a total of 29 new teachers conducting laboratory science lessons. As shown in 
Table 4.1, the observations were conducted in five areas across the state, and both Project 
and Non-Project teachers participated. Although the majority of the 29 participating 
teachers were white, in their first year of teaching, and teaching at the secondary school 
level (i.e., high school or middle school), the teacher sample wes almost evenly divided 
among males and females. 

Logistics 

Administration of the Science Laboratory Assessment required the following 
logistical activities: recruiting and training observers, identifying the teacher sample, 
scheduling the observations, sending orientation materials to the teachers, and acquiring 
evaluation feedback from the teachers and assessors. 

Recruiting and training observers. The observers for the RMC pilot test were 
carefully recruited by FWL staff with assistance from a consultant to the Science Laboratory 
Assessment project. All were experienced science teachers, either currently teaching at the 
secondary leW (i.e., middle school or senior high), working at the district level, or on 
sabbatical. Twelve observers were originally recruited, three of whom were members of the 
Science Laboratory Assessment Development Committee. Unfortunately, one of these three 
had to withdraw from the program after the training due to time constraints. RMC staff 
trained the observers in a two-day session, preceded by one day of home study. (For more 
information about the observers and their training, see the section, "Assessors and Their 
Training.") 

Identifying the teacher sample. Table 4.1 presents information about the teacher 
sample for this assessment. It was necessary to recruit Non-Project teachers in addition to 
CNTP teachers in order to have a sample that represented different regions of the state, all 
grade levels (i.e., high school, middle school and elementary), different ethnic groups, and a 
variety of teaching contexts (including physical or life science classes). As was the case last 
year, FWL staff recruited the majority of the Non-Project teachers by calling school districts 
which neighbored CNTP districts, and asking for names of first- and second-year science 
teachers. 



4.6 



ioa 



TABLE 4.1 

PILOT TEST PARTICIPANTS 
SCIENCE LABORATORY ASSESSMENT 
(Number of Teachers = 29) 



v.v.y. ,y. . ■ .:• v.v.y.v.v.v.v • X • .- •/' .v .-. ly.v . . .; 


No. of teachers 


Teacher 
wnaiactenstics 




project 


Npn~project 


Chico Area 


2 


- 


24 Caucasian, non- 








Hispanic; 2 Hispanic; 








1 Asian or Pacific 








Islander; 1 Native 


Sacramento Area 


1 


3 


American; 1 Other 








16 Male; 13 Female 








13 High School; 


San Francisco Area 


5 


3 


11 Middle School; 








5 Elementary School 
















10 Second-Year; 


Fresno Area 


3 




2 Third Year 


Los Angeles Area 


1 


3 




Total Number of Teachers 


12 


17 





4.7 



110 

r 



Scheduling the observation*. After identifying the participants, the observations 
were scheduled. Scheduling required a match between the teacher and observer in three 
areas: 

(1) availability (e.g., dates, times); 

(2) science background (e.g., life or physical science); and 

(3) teaching background (i.e., secondary or elementary teaching experience). 

A teacher teaching a high school chemistry lesson, for example, had to be paired with 
an observer who also had high school chemistry experience and who was available to 
observe on the date and at the time the teacher chose. Similarly, an elementary teacher 
conducting a life science lesson could only be paired with an available observer with a life 
science background and who also had some elementary school experience (i.e. teaching 
elementary school or elementary teachers, or developing elementary science curriculum). 
Moreover, in order to get a double-scoring sample, five teachers were observed by five 
different pairs of observers. For these observations the logistical difficulties were increased. 

Sending orientation materials. Shortly before the observations, the participating 
teachers were mailed an orientation handbook which included the domains, elements, and 
indicators of the assessment, the questions for the Pre- and Post-Observation Conferences, 
and the Pre-Observation Questionnaire to be filled out by the teacher before the 
observation. Also included were three samples of completed Questionnaires, each 
representing a different grade level (i.e., elementary, middle, and high school). 

Collecting evaluation feedback. After the observations, teachers were sent an 
evaluation form to fill out and return to FWL. Evaluation forms were also given to each of 
the observers who returned them to FWL along with their observation documentation. 

Security 

Because the content of the assessment was included as part of the orientation 
materials, the focus of security for this assessment was on the completed documentation for 
each teacher. Assessors mailed the documentation materials to FWL, where they were 
securely filed. 



4.8 



9 

ERIC 



111 



As we noted in the first-year report (Assessment Component of the California New 
Teacher Project: First Year Report, March 1990), if an observation system is selected as a 
method of assessme nt for credontialing teachers in California, procedures to ensure security 
at the observation and processing stages (and during long-term storage) would have to be 
developed and implemented. Each piece of documentation (i.e., Guided Note-Taking Forms, 
Document Sorting Records, Summary Response Forms, Pre- and Post-Observation forms, 
and Pre-Observation Questionnaires) would have to contain identifying information in case 
the pieces became separated. For this type of assessment, such information would probably 
include the following: teacher code, observer code, date of observation, and perhaps type of 
science lesson. All documentation for a given teacher credential candidate would also have 
to be retained for a minimum number of years, enough to cover the period in which teachers 
could appeal decisions, or to meet statutory requirements. 

Assessors and Their Training 

Twelve assessors were recruited and trained to conduct and score a minimum of 
three observations each for this assessment. As mentioned earlier, one of the assessors 
withdrew from the program after the training due to time constraints. This section 
describes some characteristics of the assessors, describes the training, and presents the 
perceptions of the training by the assessors and FWL staff. 

Characteristics of the assess ors. All of the assessors had several years experience as 
a teacher in California schools, were knowledgeable about at least one area of science, and 
had worked with student teachers, as a teacher trainer, or as a mentor teacher. Of the 
eleven observers who participated in the pilot test, there were five females and rix males. 
Nine of the observers were currently Leaching high school science; of the three other 
observers, one taught junior high science (and served as department chair), one was a 
district science resource specialist, and the third was on sabbatical working at a private 
chemical laboratory. All but two of the observers worked in Northern California; one 
worked in Fresno and the other in Southern California. 

Training. Training for conducting and scoring the observations lasted three days: 
one day of home study, and two days of group training. For the home study day, trainees 
were instructed to read the "Observer's Handbook," and to thoroughly familiarize 
themselves with the domains, elements and indicators, and with the assessment materials, 
forms, and procedures. 



4.9 



1 1 0 



The two-day group training was conducted by RMC staff in San Francisco on April 6 
and 7, 1990. The first day of training consisted of the following: a quick review oi all the 
documentation forms (e.g., Pre- and Post-Observation Conference Forms, Guided Note- 
Taking Forms); training in how to conduct the Pre-Observation Conference, followed by role 
playing; and practice in using the Guided Note-Taking Forms by watching and taking notes 
from videotape segments of science lessons. 

The second day of training provided the trainees with more practice using the 
Guided Note-Taking forms (using videotape segments); practice in completing a 
Documentation Sorting Record; an introduction to the Summary Report Form (i.e., how to 
score performances), and simulated practice (again using videotapes) in conducting, 
documenting, and scoring an observation. 

At the end of the two days, the trainees ' documentation from the final practice 
videotape was collected for review by the trainers. This documentation was used by the 
trainers and FWL staff to informally assess each trainee in three areas: (1) recording 
evidence and notes properly on the Guided Note-Taking Forms; (2) sorting the information 
by elements in a reasonable manner on the Documentation Sorting Record; and (3) making 
objective remarks and reasonable judgements on the Summary Report Form. Based on the 
problems trainees encountered, RMC and FWL staffs prepared a three-page list of "helpful 
hints'' which was mailed to all observers before the, did any observations for the pilot test. 

Perceptions of training. The observers were not asked by the trainers to evaluate 
their training at the end of the session. When collecting evaluation feedback from the 
observers, however, FWL staff included a page of questions about the training they had 
received. Of the eleven observers, one observer said the training as "very good," six 
observers described the training as "adequate," and three found the training to be 
"insufficient." Six of the observers also noted that they found the practico in taking notes 
from the videotapes to be the most useful part of the training. 

All of the observers had suggestions for improving the training. Suggestions for 
improvement given by more than two observers were as follows: 

• Increase training time (5 observers) 

• Increase opportunities for discussion about the assessment (4 observers) 

• Make sure the assessors understand the domains and elements at the 
beginning of the training (2 observers) 



4.10 



9 

ERIC 



113 



Based on their own observations of the training, FWL staff concur with the above 
suggestions. Training would be improved by increasing the training time, especially the 
time allotted for "hands-on" practice (e.g., practice using the different forms, practice scoring 
the candidates) and discussion of the assessments different components (e.g., 
documentation, scoring). In addition, any future training for this assessment should begin 
with a review of, or solid introduction to, the assessments content so that all of the 
participants agree on the definitions of the elements. Since the content is the foundation of 
the assessment, this review should be done before giving the observers any other task (e.g., 
asking them to watch videotape segments and collect evidence pertaining to the elements). 

Training could also benefit from more explicit instruction and examples on how to 
record evidence and notes on each of the assessment forms, as well as on how to evaluate 
teacher performances. Regarding the latter, observers should also be given ample 
opportunity to practice and discuss the evaluation process in order to help ensure that there 
is consistency among observers. (More discussion of the evaluation process can be found in 
the next section, "Scoring.") 

In order to provide time for the improvements described above, the two-day group 
training should be extended by a minimum of one full day. As evidenced by the amount of 
questions, confusion, and actual frustration expressed on the first day of the group training, 
the home-study day seemed to contribute little to the training, aside from a basic 
introduction to the assessment's concent, forms, and procedures. Although a home-study 
day was chosen by the trainers because it was very difficult to schedule three consecutive 
days for training which all the observers could attend, at least three consecutive in-session 
days of training appear necessary if the administration, content, and format of the 
assessment ara to be covered adequately. 

Scoring 

The scoring system of the Science Laboratory Assessment is an integral part of the 
assessment prototype. The same person who conducts the observation uses the 
documentation from the observation to score the assessment 

For the Science Laboratory Assessment, the scoring process is also directly linked to 
the documentation process. The observer first records duta (i.e., evidence and notes) from 
the observation directly to the Guided Note-Taking Forms. These forms, as mentioned 
earlier, require the observer to categorize the data according to domain. Upon completion 

4.H 



111 



of the observation, the observer begins step two which requires further categorization of the 
data by "sorting it on the Documentation Sorting Record (DSR). The observer sorts the 
data according to the elements which correspond to each domain. The observer also uses 
the DSR to categorize and sort by element the data collected on the teacher's questionnaire 
and on the two conference forms. 

Having sifted or sorted the data according to the elements corresponding to each 
domain, the observer is then ready to make judgments about the teacher's performance level 
in each domain. Taking ona domain at a time, the observer first reviews all the information 
on the DSR which corresponds to that domain. If, looking at all the information listed 
across the elements of the domain, the observer "feels the teacher has shown a minimally 
acceptable level of performance," then he observer writes a "2" in the rating box for that 
domain on the Summary Report Form. If the observer feels the teacher's performance is 
not acceptable, the observer caters a "1." If the observer can not make a judgement, either 
because of lack of information or a borderline level of performance between minimally 
acceptable and not acceptable, the observer enters an "X." After giving a rating, the 
observer writes three or four summary remarks about the teacher's noteworthy strengths 
and weaknesses related to the elements in that domain. The observer repeats this process 
for all seven domains. 

The last step of the scoring process requires the observer to make an overall 
judgment of the teacher's performance. The observer again rereads vhe information on the 
DSR, and then reviewB his/her judgments and remarks made on the Summary Report Form. 
After reviewing all of this information, the observer makes a judgment as to whether the 
teacher's overall performance is acceptable (a "2" rating) or unacceptable (a 'T rating). Any 
comments the observer thinks should be considered regarding the overall rating are entered 
next to the rating. 

The scoring process, like those of other high-inference observation systems, is very 
labor intensive. Not including the observation time, the entire process takes, on average, 
approximately three to four hours. 

Teacher, Assessor, and FWL Staff PimseptLcms of Administration 

AJl but one of the teachers and one of the observers expressed satisfaction with the 
arrangement*) (e.g., scheduling, room arrangements) made for the administration of this 
assessment. The one dissenting teacher did not like being assessed at the end of the school 

412 



9 

ERIC 



115 



year. The one dissenting observer-an observer who conducted four observations for the 
pilot test-stated that she had "absolutely no complaints with the logistical arrangements," 
but that she did not like having to leave her classroom in order to administer the 
assessment. As she explained, 

I had difficulty squeezing out the time to moke the carefully 
documented lesson plans my substitutes needed so that I could 
be away [to observe]. In addition,,, J had to expend 
considerable effort in advance planning so that I could create 
the kind of day a sub could handle. 

Concern, similar to the above assessor's, about trying to juggle the administration of 
these assessments with the execution of their teaching duties was also expressed verbally to 
FWL staff by other observers. Although the assessors' burdens would be lessened if the 
observations were spaced further apart (all observations for this pilot test were conducted 
within a six-week period during a busy time of the school year), good teachers devoted to 
their students would probably still experience difficulty leaving their classrooms, especially 
if they were also taking time off for other professional obligations (e.g., serving as mentor 
teachers, serving on curriculum development committees). 

Should an observation system such as this be considered for credentialing use in the 
state, the time difficulties experienced by the observers in this pilot test should be kept in 
mind. The issue is especially noteworthy if one agrees with the assessors of this pilot test, 
all of whom stated that this assessment should only be administered by experienced science 
teachers. Although the assessors differed as to how much experience is necessary (cnswers 
ranged from "moderate" to "a great deal"), all agreed tha* the assessment should not be 
administered by someone who is knowledgeable about science but has no science teaching 
bnckground. As one observer noted, "Science trained non-educators haven't a clue about 
classroom management and planning." 

Finally, with regards to administering the assessment, nine of the eleven assessors 
expressed displeasure with the amount of time it took to complete the Documentation 
Sorting Record, and eight assessors said they had difficulty observing the lesson and 
collecting evidence. Since both of these issues are directly related to the format of the 
assessment, they will be discussed more completely in the section, "Assessment Format." 



4.13 

lie 



Asfttws&ent Content 



The developers of the Science Laboratory Assessment chose science, and in 
particular, science laboratory activity, as the content and focus of this assessment for several 
reasons. First, responding to the national need for highly trained scientists, they believe 
that in order to develop top scientists, we need top quality science teachers, starting at the 
elementary school level. Second, research conducted by two of the assessment developers 
reve^ 3 the importance of science instruction for developing students ! basic skills in reading, 
mathematics, and writing (Wheeler, 1986-87), as well as more advanced thinking skills in 
these and other areas (Quellmalz, 1985). Third, educators who were asked by the California 
Commission on Teacher Credentialing to review the two NTE Specialty Area Tests in 
science (i.e., Biology and General Science; Chemistry, PhyBics, and General Science) 
expressed great concern that, in the credentialing process, there was no assessment of 
laboratory demonstration and presentation skills or of lab safety. While acknowledging that 
some aspects of lab safety could be assessed by a multiple-choice exam, all of the reviewers 
felt that the only way to evaluate the laboratory presentation and demonstration skills of 
credential candidates was through a performance assessment which focused on laboratory 
science. 

In the following pages, the content of the Science Laboratory Assessment is 
evaluated along these dimensions; 

• Congruence with the 1990 California Science Framework; 

• Extent of coverage of California Standards for Beginning Teachers; 

• Job-relatedness of the instrument; 

• Appropriateness for beginning teachers; 

• Appropriateness across different teaching contexts (e.g., grade levels, 

diverse student groups); 

• Fairness across groups of teachers (e.g., ethnic groups, gender); and 

• Appropriateness as a method of assessment 

We would like to note that, as was the case with all of the assessment instruments 
pilot tested this spring and summer, the Science Laboratory Assessment was developed for 
the State of California within a specific development timeline. Although the science 
educators who participated in the statewide review of the instrument were asked to 



4.14 



ERIC 



117 



comment on tho job necessity and appropriateness for new teachers of the domains and 
elements covered by the assessment, there was not sufficient time to conduct a larger 
content validity study. Without such a study, our ability to comment on the assessment's 
appropriateness along such dimensions as job-relatedness, appropriateness for beginning 
teachers, and appropriateness across contexts is limited. Thus, excluding the first two 
dimensions of curriculum congruence and standards coverage (which are based on FWL 
staff's analysis of the documents involved), the discussions of the remaining dimensions are 
based on the perspective of the participating teachers and assessors, and FWL staff, as 
reflected in feedback forms, in informal conversations with the assessors, and in data from 
the assessment's rating forms. 

Congruence with the 18% California Science Framework 

FWL staff reviewed the Science Laboratory Assessment to see in what ways the 
assessment is congruent with the California Science Framework, and how it could be 
modified to achieve better congruence. For our analysis, we used the 1990 Science 
Esme^mrJLfk^ Twelve.. This framework is divided into three 

parts, each focusing on a different aspect of science instruction. The first part provides a 
context for instruction by describing the nature of science and the major themes of science. 
The second part focuses on instructional content, providing examples of theories and themes 
in the life, physical, and earth sciences to be taught at different grade levels (K-12). The 
third and final part of the framework presents specific information on how to achieve a 
desired science curriculum. It includes descriptions of appropriate science pedagogy to be 
applied by the teacher in the classroom; ways in which a district or school can implement a 
strong science program, and the criteria used by the state in its adoption process of science 
instructional materials. 

Table 4.2 lists the three parts of the framework and their corresponding chapters, 
and then describes the Science Laboratory Assessment components (e.g., domains and 
el jaents, pre-obeervation conference questions) that are congruent with the framework. As 
the table indicates, there are some assessment components that are congruent with each 
pan of the framework, but only in a partial manner. Strongest congruency is with the 
framework's description of science processes and the teaching of science (Chapter 6). This 
part of the framework is addressed by the nature of the assessment (i.e., an observation of a 
hands-on science laboratory activity), several domains and elements, and some questions on 
the pre-observation conference form. Ev*n this congruency is partial, however. One part of 
the chapter, for example, presents those science processes (e.g., observing, communicating, 

4.15 



11S 



TABLE 4.2 



COVERAGE OF THE CALIFORNIA SCIENCE FRAMEWORK 
BY THE SCIENCE LABORATORY ASSESSMENT 



Content 


Relevant Assessment Components 


PARTI: WHAT IS SCIENCE? 




unapier i . Nature 01 ouenc© 


-Addressed bv the "Attitudes" element of 
the Climate domain. 


Chapter 2: Major Themes of Science 


-The "Integrated" element of the Content 
domain requires that the teacher knows 
the theme of the activity. #14 on the 
Teacher Questionnaire asks the teacher 
to specify the theme of the activity. 


PART II: CONTENT OF SCIENCE 




Chapter 3: Physical Science 
Chapter 4: Earth Sciences 
Chapters: Life Sciences 


-The content of the observed activity 
should fit into one of these three areas. 


PART III: ACHIEVING THE DESIRED 
CURRICULUM 

Chanter 6' Science Processes end 
the Teaching of Science 


-Chapter 6 content addressed by several 

domains and elements: Pedagogy 
("Planning," "Questioning"); Knowledge 

of Students ("Diversity," "Student 

Characteristics") and Climate 

("Interaction with Students", "Inquiry") 

-Also addressed by Pre-Observatkvn 
Conference Questions #5, 6, and 10. 

-Observation must be of hands-on activity. 


Chapter 7: Implementing a Strong 

Science Program (at school 
district and site level) 


-Not addressed by this assessment. 


Chapter 8: Instructional Materials 
Criteria (as applied 
to adoption of materials) 


-Not addressed by this assessment. 


4.16 



3 

ERIC 

113 



comparing) which can best be expected from students at different grade levels. Although 
the Science Laboratory Assessment includes references to science processes in at least two 
domains, it does not make the grade-level distinctions presented by the framework. 

Perhaps one way in which the assessment could be modified to achieve better 
congruence with the framework would be to weave the idea of science themes throughout 
more of the assessment. As stated in the framework, "the 1990 Science Framework differs 
from previous frameworks. ..in its emphasis on the major themes of science." Although the 
Science Laboratory Assessment asks the teacher to specify on the teacher questionnaire the 
theme(s) of the laboratory activity to be observed, the observer is not asked to find evidence 
that the teacher has presented the theme(s) to the students, either orally, in written 
materials, or in any part of the lesson. The idea of themes could easily be written into 
several elements of the Pedagogy domain (e.g., Planning, Prelab, Explanation/Presentation), 
and the wording in the description of the Integrated element of the Content domain could 
be changed so that the teacher doesnt just know how the activity is related to a major 
theme of science, but also presents this information to the students. 

Two other possible changes to be considered would be to include somewhere in the 
assessment (e.g., in an element) some reference that (1) the teacher discusses or presents, 
whenever possible and appropriate, any values and ethics associated with the science 
activity, and (2) the teacher uses written instructional materials that meet the standards in 
the framework. 

Extent of Coverage of California Standards for Beginning Teachers 

Because the Science Laboratory Assessment was developed for the State of 
California, the developers designed the assessment to cover at least some of the California 
Standards for Beginning Teachers. FWL staff examined the four components of the 
assessment-the questionnaire, the domains and their corresponding elements, and the pre- 
and post-conference questions-to see how well they assess the California Beginning Teacher 
Standards which define levels of pedagogical competence and performance that California 
teacher credential candidates are expected to attain (i.e., Standards 22 to 32). As was done 
in the March, 1990 report, the standards are reprinted below (in italics), along with an 
analysis of how the assessment components correspond to each standard. 



4.17 

1 on 

JL. W 



Standard 22: Student Rapport end GaMrooxr Environment Each candidate 
establishes and sustains a level of student rapport and a classroom environment that 
promotes learning and equity, and that fosters mutual respect among the persons in a class. 
Both the Climate and Management domains address this standard. Elements in the Climate 
domain require the observer to look for evidence that the teacher communicates and 
interacts respectfully with all students, communicates high expectations for student learning 
and behavior, and that students treat erch other respectfully and politely. An element in 
the Management domain asks the observer to find evidence that the teacher encourages and 
reinforces appropriate student behavior. 

Standard 23: Curricular and Instructional Planning Skills. Each candidate prepares 
at least one unit plan and several lesson plans that include goals, objectives, strategies, 
activities, materials and assessment plans that are well defined and coordinated with each 
other. The assessment requires the teacher to plan a 30-45 minute (depending on grade 
level) science laboratory lesson for observation and to specify on a pre-observation 
questionnaire the objectives, laboratory activities, student groups, materials and equipment, 
and safety issues. The questionnaire does not, however, ask the teacher to specify any 
assessment plans. 

Two domains also address this standard. The P/annm^element in the Pedagogy 
domain defines, at a general level, what the teacher's objective(s) should look like (e.g., 
involve the development or utilization of one or more of the scientific thinking processes), 
and the Sequencing element in the same domain asks the observer to look for evidence that 
the teacher organizes the laboratory activity in a logical or purposeful manner that allows 
students to achieve the lesson objecuve(s). The Content domain has an element which asks 
the observer to find evidence that the teacher uses methods that are related to the 
objective(s) of the laboratory activity. 

Standard 24: Diverse and Appropriate Teaching. Each candidate prepares and uses 
instructional strategies, activities, and materials that are appropriate for students with 
diverse needs, interests and learning styles. Three domains and two conference questions 
address this standard. The Knowledge of Students domain asks the observer to look for 
evidence that the teacher uses instructional strategies and/or activities that are appropriate 
and challenging for diverse students (e.g., different ethnic, cultural, language, and 
socioeconomic backgrounds, and disabled students) and students with different student 
characteristics (e.g., different interests, cognitive and developmental levels, prior 
knowledge). 

4.18 



121 



In the Pedagogy domain, tho observer is expected to look for evidence that the 
teacher knows the students 'prior learnings, gives directions, explanations, and 
presentations at an appropriate level of complexity and difficulty for the students, and 
provides objective feedback to and asks questions of students regardless of abi* Ji ", ethnicity, 
or other characteristics. The Availability element in the Materiali/Equipment aomain asks 
for evidence that the teacher has made provisions for materials to be available to physically 
disabled students. 

In the Pre-Observation Conference, the teacher is asked if s/he designed or modified 
the activity to make it appropriate for the students 'background and interests, and also to 
describe how the lab activity is related to prior instruction (e.g., which might be construed 
as prior learnings). 

Standard 26: St Activation, Involvement, and Conduct Each candidate 
motivates and sustains l i • £ interest, involvement and appropriate conduct equitably 
during a variety of class activities. Several domains address this standard: The Pedagogy 
domain asks the observer to find evidence that the teacher provides motivating feedback to 
all students, and that the teacher's questioning involves as many students as possible; the 
Materiale/Equipment domain looks for evidence that, regarding materials/equipment, the 
teacher has provided easy access and enough so that all students can complete the activity; 
elements in the Management domain ask for evidence that the teacher has structured the 
laboratory activity so that most of the students are engaged in a laboratory task most of the 
time, and that the teacher encourages and reinforces appropriate student behavior; and the 
Climate domain seeks evidence that the teacher provides all students with an opportunity to 
participate and learn. 

Standard 26: Presentation Skills. Each candidate communicates effectively by 
presenting ideas and instructions clearly and meaningfully to students. The Communication 
domain asks the observer to find evidence that the teacher's oral and written 
communications are clear and "not vague, ambiguous, or incomplete." Although there is no 
specific mention of the teacher adjusting the complexity of his/her language to the linguistic 
abilities of the students, the Pedagogy domain asks for evidence that the teacher gives 
directions, explanations and presentations that are at an appropriate level of complexity and 
difficulty for the students. 



4.19 

1 O 0 



Standard 27: Student Diagnosis, Achievement and Evaluation. Each candidate 
identifies students' prior attainments, achieves significant instructional objectives, and 
evaluates the achievements of the students in a class. Two domains address this standard. 
The Pedagogy domain has elements which ask the observer to find evidence that the teacher 
knows what prerequisite skills and knowledge the students have for an activity (but it does 
not ascertain how s/he knows), monitors student understanding and work during the 
activity, and adjusts the lesson or activity as needed. The Climate domain asks for evidence 
that "the teacher communicates high expectations for student learning." 

Some of the p .^-observation conference questions also ask the teacher what s/he 
knows about the students ' prior knowledge, but again do not ask how s/he assessed this 
knowledge. In the post-observation conference, the teacher is asked if the objectives were 
attained by the students, and what is the teacher's plan to assess the retention of these 
objectives. 

Standard 28: Cognitive Outcomes of Teaching. Each candidate improves the ability 
of students in a class to evaluate information, think analytically, and reach sound 
conclusions. The Pedagogy and the Climate domains address this standard. For the 
Pedagogy domain, one element asks the observer to look for evidence that the objectives for 
the activity involve the development or utilization of one or more of the scientific thinking 
processes (i.e, observing, , communicating, comparing, ordering, categorizing, relating, 
inferring, and applying). Another element asks for evidence that the teacher asks questions 
that promote higher-order thinking processes (such as those listed above). The Climate 
domain asks for evidence that the teacher fosters an environment that promotes 
questioning, problem solving, discussion of error, and evaluation of competing ideas. 

A pre-observation conference question also addresses this standard by asking the 
teacher, "What advanced thinking skills (e.g., comparing, estimating, inferring) will students 
be encouraged to use or required to apply in order to productively participate in this 
activity?" 

Standard 29: Affective Outcomes of Teaching. Each candidate fosters positive 
student attitudes toward the subjects learned, the students themselves, and their capacity to 
become independent learners. The Climate domain's four elements address this standard by 



4.20 



9 

ERIC 



123 



asking the observer to find evidence that the teacher interacts with all students positively, 
encourages sharing among students, attempts to instill in students positive attitudes about 
learning and science, and fosters an environment in which tne processes of science are 
important. One of the Pedagogy domain's elements also addresses this standard by asking 
for evidence that the teacher gives feedback to students that "provides positive rewards, 
useful information, further motivation, or encouragement to students." 

Although none of the domains specifically addresses the promotion of students as 
independent learners, the focus of the assessment is a science laboratory activity which, by 
its nature, usually involves dome form of independent learning by students. 

Standard 30: Capacity to Teach Cross-Culturally. Each candidate demonstrates 
compatibility with, and ability to teach, students who are different from the candidate. The 
differences between students and the candidate should include ethnic, cultural, gender, 
linguistic and socio-economic differences. The Knowledge of Students domain asks the 
observer to seek evidence that the teacher tailors instructional activities for a diverse 
classroom of students with different ethnic, cultural, language, and socioeconomic 
backgrounds. As is probably the case with any observation system, however, a teacher's 
capacity to teach cross-culturally can probably only be demonstrated if the teacher is 
teaching in an ethnically diverse classroom. 

Standard 31: Readineee for Diverse Beeponsibilities. Each candidate teaches 
students of diverse ages and abilities, and assumes the responsibilities of full-time teachers. 
This standard focuses on a teacher's ability to teach classes which span the range covered 
by the credential (i.e., grades K-8 or 7-12) or students at two or more ability levels (such as 
remedial and college preparatory classes). None of the domains are designed to assess this 
ability. This standard also addresses a teacher's ability to fulfill typical responsibilities of 
teachers such as meeting school deadlines and keeping student records, none of which are 
assessed by any of the domains. 

Standard 32: Professional Obligations. Each candidate adheres to high standards of 
professional conduct, cooperates effectively with other adults in the school community, and 
develops professionally through self-assessment and collegial interactions with other 
members of the profession. None of the domains assess whether a teacher fulfills his/her 
obligations as a member of a profession and a school community (e.g., adheres to high 
standards of professional conduct and engages in collegial relationships). 



4.21 



124 



The extent of coverage by the Science Laboratory Assessment of the California 
Beginning Teacher Standards is summarized in Table 4.3. The table lists the assessment's 
domains and questions which address each standard, and also describes the extent of 
coverage provided. 

Job-rolatodnoM 

All 29 of the teachers who participated in the RMC assessment pilot test stated that 
the seven domains chosen for this assessment are relevant to their job of teaching. The 
pilot tests eleven observers also evaluated the content of the assessment as being relevant 
to the job of a new teacher of science laboratory lessons, although one observer qualified his 
answer: M It is only relevant if the teacher gets to see the report." Two observers praised the 
instrument's relevance as follows: 

I think it is vitally importart to have a method to assess science 
knowledge/attitude inquiry questioning techniques/lab safely for 
new tbAchers. 

The assessment is relevant to teaching lab science for any 
teacher, beginning or experienced. 

As was discussed in the March, 199C :report, the job-relatedness of observation 
systems is strong because such systems almost always entail observing teachers actually 
teaching in their own classrooms (or to their own students). Moreover, job relevance is a 
particularly important factor in evaluating different approaches to teacher competence 
assessment, because professional practitioners and courts of law consider this factor first 
when they judge the fairness of an evaluation system. As an observation system, the 
Laboratory Science Assessment offers direct evidence of actual teaching competence. With 
such an assessment, it is not necessary to make inferences about how well a teacher 
conducts instruction. 

Appropriateness for Beginning Teachers 

In this section, the appropriateness of the Science Laboratory Assessment for 
beginning teachers is discussed from two perspectives: 1) the perceptions of the 
participating teachers and asset, i* and 2) the teachers ' performance on the assessment. 



4.22 



ERIC 



X 2 c 



TABLE 4.3 

EXTENT OF COVERAGE BY THE SCIENCE LABORATORY ASSESSMENT OF 
CALIFORNIA STANDARDS FOR BEGINNING TEACHERS 



Standard 


!ff|| Domains and Conference 
Queens 


Extent of 
Coverage 


22: Student Rapport and Classroom 
Environment 


-Management 
-Climate 


Full 


23: Curricular and Instructional 
Planning Skills 


-Pedagogy 
-Content 


Partial 


24: Diverse and Appropriate 


-Knowledge of Students 
-reoagogy 
-Materials/Equipment 
-(Pre-Obs. Conf. #3, #6) 


Full 


25: Student Motivation, 

invoiverTieni ana uonauci 


-Pedagogy, Climate 
-materials/ equipment 


Full 


26: Presentation Skills 


-Communication, Pedagogy 


Full 


07* Qti iHont Pionnocic 
Ci . OlUUtfNl UldyriUolo, 

Achievement and Evaluation 


-reaagogy 
-Climate 

_fPrp~Ohe Pnnff UK Mf\\ 

"yri D*UUv>i vUlll. TTO, WSJ) 

-(Post-Obs. Conf. #2, #3) 


Partial 
Pal Ual 




— rcuayuyy 

-Climate 

-(Pre-Obs. Conf. #10) 


Full 

run 


29: Affective Outcomes of Teaching 


-Pedagogy, Climate 


Full 


30: Capacity to Teach 
Crossculturally 


-Knowledge of Students 


Partial 


31 : Readiness for Diverse 
Responsibilities 


-None 


None 


32: Professional Obligations 


-None 


None 



4.23 jog 



Perception* When asked if they felt they have had an opportunity to acquire the 
knowledge and abilities measured by the Science Laboratory Assessment, approximately 
three-fourths (21 of 29) of the teachers responded positively; seven said "no," and one did 
not respond. Of the seven teachers who replied negatively, four specifically commented that 
one year is not enough time to achieve mastery of skills and knowledge. This sentiment 
was also echoed by two of the teachers with posiave responses who stated that they were in 
their second year of teaching and their answers might be different if they were in their first 
year. 

The eleven assessors were also asked if they thought new teachers have had an 
opportunity to acquire the knowledge and abilities measured by the assessment. Five 
assessors responded positively, albeit two with qualifications (e.g., "if criteria [are] not too 
objectively applied"). One of the five stated that a teacher's academic and professi mal 
preparation "covers all areas." Another commented, "no problem," because the assessment 
focuses on minimal proficiency. 

Of the remaining six assessors, one did not respond, two gave answers which were 
ambiguous, and three expressed a belief that the instrument may be too difficult for a 
"brand new" teacher (i.e., less than one year's experience). 

Performance on assessment Overall, FWL staffs analysis of the rating results 
support the majority contention that the new teachers have had an opportunity to acquire 
the skills and knowledge measured by the assessment. Of the 29 teachers observed, all but 
two received an overall judgment of passing (i.e., a "2" rating). One teacher received an "X** 
rating, indicating insufficient information to warrant a judgment, and the other teacher was 
not given an overall rating. Furthermore, no teacher failed (i.e., received a "1" rating) more 
than two domains (see Figure 4.3), and at least 18 teachers passed all domains. 

Of the seven teachers who failed a domain, however, five were in their first year of 
teaching. Since 17 of the 29 teachers were first-year teachers, almost one third of the first- 
year teachers had difficulty in at least one area of the assessment. For three of those five 
teachers, that area was the Content domain. 

In fact, of the seven teachers who failed a domain, four failed the Content domain. 
The reasons given for their "failure" tended to fall into uvo categories: (1) insufficient or 
missing content, and (2) inaccurate content. For example, a middle school, life science 
teacher whose laboratory activity was a frog dissection, was given a "1" rating 

4.24 



9 

ERIC 



127 



FIGURE 4.3 
Percent of Teachers Receiving 
a "2" Rating on Each Domain 




Pedagogy Content Materials/ Mngmnt Knowledge Climate Communi... 

Equipment of Stdnts 



125* 

128 

ERIC 



because the content she presented was "not extensive," but rather was "mostly label the 
diagram." In addition, the observer faulted her for "little or no discussion of the function of 
organs in humans vs. frogs." A high school, chemistry teacher who conducted a lab titled "A 
Reaction with Copper," was given a "1" rating because she did not know the symbol for 
copper. Her assessor also remarked, "Should not mass anything while hot as it will give 
results too small." A different error of commission was made by a high school, biology 
teacher who designed his own lab for an endocrine unit. He was failed because "the 
relationship the teacher was suggesting (iodine-thyroxine-respiration rate) is not valid." He 
was also cited for insufficient or missing content: "The relationship between the 
temperature and respiration rate is valid, but was not actually addressed by the teacher to 
the students." 

As evidenced above, the reasons given for failure, while falling into two categories, 
were vastly different. One teacher is failed for not knowing the symbol for copper; another 
is failed for misstating the relationship between an element, an amino acid, and a biological 
function. Moreover, it is interesting to note that although the Content domain was written 
so that a teacher could be failed for inaccurate content, absolutely nothing was written to 
suggest that a teacher could be failed for insufficient or missing content (Revisions were 
made, however, after the pilot test, to incorporate the concept of sufficiency into the scoring 
process.) 

One last note about content. The fourth teacher who received a "1" rating for the 
Content domain was one of five teachers who were double scored. One of the two assessors 
who observed this teacher passed her on tht domain, and the other failed her. The assessor 
who failed her described her content as follows: 

Not accurate because printed materials erroneously confiised 
mass and force leading teacher to the same error. Teacher lei 
students have weights touching bottom of container while 
reading mass (weight?) on spring scale, and did not recognize 
this was not an accurate reading. 

The assessor who passed her did not note any of the above problems and described her 
content as accurate. 



4.26 



130 



Another problem area for teachers-and also for assessors-was the Knowledge of 
Student! domain. Of the five teachers who received an "X" rating (i.e., a borderline 
performance or insufficient information to make a rating), three received tiie rating in the 
KsaowkdgB of Sfeadente domain. These three "X" ratings suggest two possibilities: One, it 
may be difficult for new teachers with little classroom experience to exhibit the kind of 
behavior required by the Knowiedga of Students domain. One of the domain's elements, for 
example, specifies that, "the teacher tail™* instructional activities for a diverse classroom of 
students with different ethnic, cultural, language, and socioeconomic backgrounds and, when 
present, disabled students...and each student is challenged at an appropriate level. It is 
possible that this is a lot to ask of a beginning teacher. Or, as one assessor stated, 

Some of the domains/elements seem to me to be very advanced 
teaching skills that most beginning teachem will not yet have 
acquired (i.e., ability to adjust an individual activity in a variety 
of ways to meet different student needs). 

A second possibility is that these ratings indicate that it is difficult for an observer to 
assess this domain unless the observer has a strong knowledge of the students in the 
classroom. This possibility, however, will be discussed in the next section, "Appropriateness 
across Contexts." 

Although, as indicated earlier v the majority of teachers felt they hid an opportunity 
to acquire the skills and knowledge measured by this assessment, many of the new teachers 
agreed with the above assessor who thought some domains/elements are harder than others. 
When asked to name the domains/elements which they thought a new teacher could pass 
only after two years of experience in the classroom, 15 and 13 teachers respectively named 
the Management and Knowledge of Students domains (see Figure 4.4). It is also interesting 
to note, however, that when asked which of the domains/elements could be passed 
immediately after student teaching more than half of the teachers (19 of 29) named the 
Content domain, but all other domains received less than half of the teachers' votes (see 
Figure 4.5). Looking again at the rating results, the data seems to suggest that 1) the 
teachers' perceptions of their mastery of science content may be inflated, and 2) their 
perceptions of the difficulty of passing the Knowledge of Student! domain may be more on 
target. Based on the teachers' perceptions of domain/element difficulty, the high rate of 
passing may also be attributable to the fact that the observations were conducted in the 
spring when all of the teachers had at least close to one year's experience in the classroom. 



4.27 



131 



Figure 4.4 
Domains Teachers Believe Could Only Be 
Passed W/ 2 Yrs. Classroom Experience 



No. of Teachers 




Pedagogy Content Materials/ Mngmnt Knowledge Climate Communic. All None 

Equipment of Studnts 



Figure 4.5 
Domains Teachers Believe Could be Passed 
Immediately After Student Teaching 



No. of Teachers 



29 




Pedagogy Content Materials/ Mngmnt Knowledge Climate Communic. All None 

Equipment of Studnts 



Appropriateness across Contests 



In order to determine if the teachero and assessors felt the Science Laboratory 
Assessment is appropriate for teachers across contexts, we specifically asked them to 
comment on the assessment's appropriateness across grade levels, for teachers of diverse 
student groups, and/or in different school/community settings. 

Across grade levels. Approximately 83% of the teachers (24 of 29) felt the 
assessment is appropriate for teachers across grade levels. Four teachers disagreed and one 
did not respond to the question. Of the four who disagreed, all were middle or high school 
teachers who thought the assessment was less appropriate for elementary teachers. Their 
reasons for disagreement, however, were not compelling. One middle school teacher, for 
example, commented that "elementary teachers and students would feel very uncomfortable 
with someone looking over their shoulder. 11 Another middle school teacher stated, "It is 
more important for a primary school teacher to have good knowledge of students and a 
positive climate, than to worry about the planning and sequence." 

The scoring results do not suggest that elementary teachers are penalized by this 
assessment. As depicted on Figure 4.6, the elementary teachers performed as well or better 
on each of the assessment's domains. For every domain but one, all five elementary 
teachers received a "2" rating. For the Content domain, one teacher received an "X" rating. 

The assessors' comments regarding the elementary teachers' performances also 
support the idea that the assessment is as appropriate for elementary teachers as for middle 
and high school teachers. For example, one assessor observed a first-year male elementary 
science teacher teaching a second-grade class a lab activity involving a comparison of seeds. 
The following is a comment made by the assessor to explain why she gave a "2" rating to the 
teacher's performance in the Pedagogy domain: 

The lesson involved exploration and imagination. Students 
were to use scientific thinking to come up with their own ideas 
about bow seeds might be designed for dispersal. The 
directions were weak, but the teacher did excellent job of 
eliciting ideas from students through questioning. 



4.30 



© 1 f % 

ERIC 



FIGURE 4.6 

PERCENT OF TEACHERS BY GRADE LEVEL RECEIVING 
A 'TWO' RATING ON EACH DOMAIN 



100% 



ao%- 



60% 



40% 



20%- 



■ Hp 



Etomntary H*t 
100% 100% 100% 100% 100% 





•I •MM* 



MM!) ftohool *M0 (1 Mttafog) 



100% 



100% 



•0% 



•0% 




40% 



20% 



Hlf* Softool IH8 
100% 



100% 



•0%- 



•0% 




100% 100% 



40% 



20%- 




4.31 



1.?7 



Another elementary teacher, a kindergarten teacher with a Multiple Subjects 
Credential, was observed teaching an A.LM.S. science activity called "Huff and Puff," which 
is part of a 1 s ger unit on aeronautics. According to the assessor who observed th . lesson, 
the activity demonstrates that "air is energy and can be used to do work." Working with 
approximately eight students at a time (an aide worked with the remainder of students on 
something else), the teacher involved the students in blowing a variety of objects to see how 
many "blows" it took to move each object three feet. The activity involved problem-solving 
(i.e., students had to guess how many "blows" each object would require), collecting data, 
and recording data on a graph. For double-rating purposes, two assessors observed this 
teacher instructing the activity. Each assessor gave the teacher an overall rating of "2," and 
each had high praise for the teacher: 

I observed, to my astonishment, a kindergarten teacher doing 
exactly what I try to do at the senior high level, using the same 
skills, the same inquiry methods, and doing it very, very well. 

Excellent young teacher. Not really her first year. The best 
inquiry/critical thinking skills type questions Tve heard in a 
long time! 

It should be noted that this teacher was teaching her first year in California, but hf i 
taught for two years in another state. It should also be noted that all five of the elementary 
teachers who participated in the pilot test were either hired as elementary science teachers 
or had received substantial science training through their district. Thus, when FWL staff 
agree that this assessment seems appropriate for teachers of all grade levels, we mean to 
say that it seems to be a fair assessment for those teachers, regardless of grade level, who 
have been trained tn teach science. 

Diverse students. The developers of the Science Laboratory Assessment are well 
aware of the increasing diversity in California's classrooms. As a result, they included in 
the content of their assessment a domain specifically targeted to assessing a new teacher's 
ability to work with diverse students. This domain, Knowledge of Students, was designed to 
assess a teacher's ability, within a laboratory setting, to teach (1) students with different 
ethnic, cultural, language, and socioeconomic backgrounds, and disabled students, and (2) 
students with different interests, cognitive and developmental levels, and prior knowledge. 



4.32 



ERJC 



133 



Awareness of student diversity is not limited to one domain, however. Observers are 
also asked to consider a teacher's response to student diversity in other domains. For 
example, the elements, Feedback and Questioning, of the Pedagogy domain include specific 
references to student diversity; the Grouping element of the Management domain asks the 
observer to find evidence that the teacher has considered the "variable work rate of different 
students"; and the Availability element of the Materials/Equipment domain requires that the 
teacher makes materials and equipment accessible to physically disabled .cudents, when 
present. 

On the surface, then, it would seem that the Science Laboratory Assessment is able 
to assess a teacher's ability to work with diverse students. But can it? Although 
approximately 90% (26 of 29) of the teachers felt the assessment is appropriate for teachers 
of diverse student groups, the assessors were not so quick to agree. Almost half of the 
assessors named the Knowledge of Stmienti domain as the hardest domain to rate, and 
more than half of the assessors (6 of 11) had serious reservations about using the 
assessment to assess a new teacher's ability to work with diverse student groups. Said one 
assessor: 

I feel the assessment was weakest in this area. Teachers did 
not seem to have confidence and complex enough skills to really 
discuss this area and it wasn't always possible to observe 
needed skills in one observation. 

This assessor also posed the question: 

Because these teachers are just beginners, is it realistic to 
expect them to be able to adjust their activities/techniques to 
meet individual needs and be able to discuss how/why they do 
what they do? 

Another assessor echoed this sentiment, stating that "the skill/ability to work with 
heterogeneous groups is the most difficult to learn....it takes time to get good at working 
with diverse groups." 



4.33 



1311 



The majority of assessors, however, did not see the problem as one residing with the 
teachers, but rather with the assessors themselves. These assessors commented on the 
difficulty of always being able to recognize the different kinds of diversity among students. 
For example, 

Knowledge of Students was difficult to assess by mere 
observation without questioning the teacher. It is difficult to 
"see" which students are slower or faster so you can judge if the 
teacher deals with them differently. 

and, 

There are places where evidence or notes can be made, yet I 
found this type of evidence hard to gather, partly because I did 
not know what students were GATE, LEP, etc. unless I asked the 
teacher to point them out. 

Similarly, an assessor who pointed out that "behavior problems are easier to identify 
then LEP or "science shy" types, commented, 

If s hard to know when the teacher has tailored lessons for 
students, and then to actually observe that, if you don*t know 
who the students are. 

In other words, unless an assessor observes a teacher making a major blunder, an 
assessor's ability to assess a teacher's ability to work with diverse students depends largely 
on the assessor's knowledge of the students. Without such knowledge, it is, as one assessor 
stated, "hard to make an educated and informed opinion." 

Thus, while the Science Laboratory Assessment was designed to take into account 
the diversity of California's classrooms, it is questionable whether it is an effective way of 
assessing a beginning teacher's ability to work with diverse students. Although further 
study would need to be done to answer that question, FWL staff believes the Science 
Laboratory Assessment can be commended for putting ft focus on a teacher's ability to work 
with diverse students, and recognizes that such a focus has the potential of improving 
teachers 'skills in this area. 



4.34 



ERIC 14 U 



Fairness acroM Groupt of Teachers 



A majority of the teachers and the assessors responded positively to the question of 
fairness of the assessment across groups of teachers (e.g., different ethnic groups, different 
language groups). Only one teacher gave a negative answer, and this was a Caucasian 
female who stated that she did not feel qualified to speak for other ethnic/language groups. 
Of the ten assessors who responded to the question, nine seemed to agree with the assessor 
who stated: 

The domains of learning are the same regardless of the 
characteristics of the teacher. 

One assessor, however, was not sure of the assessment's fairness because, as she explained, 

Issues of management, climate, and communication will 
certainly be open to question. It is always possible that an 
observer may misinterpret or miss pertinent evidence. 

Our very limited sample showed no trends where certain groups of teachers did less 
well than other groups. For example, of the five teachers who are non-Caucasi ~n, non- 
Hispanic, only one did not receive a "2" rating for all seven domains (one teacher received a 
"2" rating for six domains). Of the seven teachers who received one or more "1" ratings, we 
observed no clear patterns regarding age, gender, etc. 

Appropriateness as a Method of Assessment 

In addition to evaluating the appropriateness of the Science Laboratory Assessment 
for beginning teachers, and its appropriateness across contexts and groups of teachers, the 
teachers and assessors were asked to evaluate the appropriateness of the method of 
assessment, and to compare it with other methods of assessment which they have 
experienced. 

Appropriateness. The teachers were asked if they thought this type of assessment 
(i.e., classroom observation of a science type of laboratory activity) is an appropriate way of 
assessing 1) general teaching skills, and 2) skills in teaching laboratory science. Their 
answers to both were positive, with 76% (22 of 29) and 79% (23 of 29) of the teachers 



4.35 



Ml 



replying "yes" respectively. The assessors were even stronger in their affirmation, with 90% 
(10 of 11) and 82% (9 of 11) saying "yes" respectively to the assessment's appropriateness in 
assessing general teaching skills and skills in laboratory science. One assessor remarked, 

This is an organized way to take a look at science teachers and 
see what they actually do in a way that helps delineate 
excellent practices as wt>// as practices that are missing, yet 
needed. 

Another assessor, however, disagreed that the assessment is an appropriate way to assess 
the science teaching skills of beginning teachers. As she explained, 

J think a lot of first year teachers shy away from a lot of lab, 
hands-on activities because they are 1) unfamiliar with what is 
available in the school/community, 2) labs take time to set up } and 
3) management is important in a lab and not well developed 
your first year of teaching. 

Comparison All of the teachers were asked to compare the Science Laboratory 
Assessment with other assessments with which they have been evaluated (e.g., multiple- 
choice exams such as CBEST and NTE Specialty Area. Tests, classroom observations during 
student teaching) in terms of its ability to assess teaching competency. Approximately 70% 
of the teachers (21 of 29) commented favorably about the assessment, many stating that th- 
assessment is better than the NTE and/or CBEST tests. Almost 20% of the teachers (5 of 
29) said the assessment compared favorably with the classroom observations they had 
during student teaching. Some teachers commented as follows: 

Much superior to test of subject areas knowledge. Compares 
favorably with student teaching observations, 

While the 9 were no distinctly negative comparisons, some teachers did describe 
weaknesses of the assessment. For example, four teachers said they did not like the 
assessment or could not compare it with others because they did not receive any feedback; 
two teachers stated that this type of assessment makes a first-year teacher very nervous 
because "you are put on the spot to perform"; and one teacher, wuj commended the 
assessment for "evaluating a lesson (which CBEST and NTE do not)" expressed concern 



4.36 



ERIC 1 J2 



that, in this assessment, the observer "becomes a stenographer and spends an excessive time 
writing down as much as possible to find each domain and element" so that there is the 
possibility that "the big picture is lost while the details are haplessly pursued," 

Finally, four teachers said the assessment would be best if used in conjunction with 
other methods of assessment. 

Assessment Format 

Although the classroom observation is a traditional method of teacher assessment, 
the Science Laboratory Assessment breaks new ground because it was designed to focus on a 
teacher's performance during a particular activity (i.e., a laboratory activity) in a particular 
subject area (i.e., science). Thus, whereas all classroom observation systems are relatively 
easy to administer because they require minimal materials (e.g., paper and pen for the 
assessor), the Science Laboratory Assessment may be more difficult to administer because it 
requires the observer to assess a teacher's performance while s/he is engaged in a specific 
activity that, by definition, is more student-centered than teacher-centered. Moreover, the 
Science Laboratory Assessment is not a checklist, but requires the observer to collect 
evidence during the observation by constantly writing down exactly what the observer sees 
during the lesson. In addition, the assessment requires the observer to categorize the 
evidence at the same time s/he is collecting it (i.e., writing it down). The assessment's 
analysis process also entails much more writing than traditional systems, and, perhaps, 
more careful codification. 

Still other format issues to consider are 1) the Science Laboratory As< >sment, like 
all observation systems, cannot easily be administered to groups of teachers because its 
format requires one assessor observing one teacher at a time, and 2) the assessor must be 
able to travel whatever distance is necessary to observe at the teacher's school site (or 
science laboratory setting) because that is where the assessment takes place. As was 
mentioned in the March 1990 report, these format issues pose a formidable challenge in the 
state of California. 

In this soction, the format issues which can be more easily addressed will be 
discussed. These include the clarity of the assessment's preparation materials for the 
teachers, the clarity of the pre- and post-observation conference questions, and the clarity of 
the documentation and rating forms and process used by the assessors. 



4.37 

1 1 3 



Clarity of the Teachssn ' Preparation MaisriaSs 



To prepare for the assessment, each participating teacher was asked to 1) read an 
orientation handbook which described the content (i.e., domains, elements, and indicators) 
and format of the assessment, and 2) complete the Fre-Observation Questionnaire which 
was found in the handbook. Although only 55% of the teachers (16 of 29) said they read the 
orientation handbook carefully, 93% (27 of 29) said that the handbook clearly described the 
aspects of teaching being measured by the assessment. Some teachers praised the handbook 
as "very good" and "very clear ,and understandable .* 

When asked to offer suggestions for improvement, several teachers commented on 
the size of the handbook, describing it as "massive''' or "too long," and suggested it be 
simplified. Only one teacher, however, offered a suggestion as to how it might be simplified, 
citing the existence of "several redundant sheets in the examples section." Other 
suggestions for improvement included 1) more warning as to the amount of paperwork (i.e., 
the Pre-Qbservation Questionnaire) they had to complete before the observation, and 2) 
more specific labeling of the "seven pages of terminology" (i.e., the domains, elements, and 
indicators) as the assessment content. 

Reviewing the orientation handbook sent to teachers, FWL staff did not find the 
materials to be massive or too lon& but did feel that they could be better organized. A 
Table of Contents could be provided so as to alert the teacher to the specific contents of the 
handbook, and the seven pages describing the assessment content could be immediately 
proceeded by a brief introduction which clearly states that the information which follows 
(i.e., the domains, elements, and indicators) are those things which the observer will be 
looking for during the observation. In the examples section, there are three examples, each 
of which includes two pages (pgB. 7 and 8) that are exactly the same, and, therefore, could 
be considered redundant. FWL staff does not believe, however, that the elimination of these 
pages would improve the materials, because the pages are reference materials to be used in 
conjunction with some of the questions on the preceding page (i.e., p.8). 

The six-page Fre-Observation Questionnaire that the teachers bad to complete 
consisted t f a variety of questions about the class to be observed, as well about the 
laboratory activity to be taught. In addition to the questions, the teacher is asked to fill out 
a chart describing the objectives, activities, student groups, materials and equipment and 
safety issues for the lab activity. The questionnaire is then read by the assessor before the 
observation and the pre-observation conference, 

4.38 



141 



When the teachers were asked if they had any difficulties completing the 
questionnaire, 93% (27 of 28) said "no." Of the two teachers who said "yes," one claimed she 
did not receive the questionnaire, and the other stated she had difficulty because she was 
"unaware of the list of science themes" and that "this was information that I should have 
been aware of through my teacher education classes." (The list of science themes (e.g., 
energy, environment, stability, evolution) is included as part of the questionnaire to help the 
teacher answer question #14 which asks, "Which scientific theme(s) best pertains to your 
laboratory activity?") 

Although the majority of teachers said they did not experience any difficulties with 
the questionnaire, it should be noted that some of the assessors thought otherwise. An 
assessor who observed three teachers, two of whom had not completed the questionnaire 
when he arrived, commented that the teachers "have problems stating objectives if the lab is 
part of regular curriculum material." Another assessor remarked that at least three of the 
four teachers she observed seemed "overwhelmed" by the materials they were given, and 
that the questionnaire "seemed wordy, too lengthy and burdensome for these people to deal 
with." 

One other piece of e vidence that suggests that the teachers may have had problems 
with the questionnaire even if they did not say so is the amount of time it took the teachers 
to complete the questionnaire. Although the majority of teachers who specified their time 
said they needed from between 20 and 40 minutes to complete the questionnaire, one-fourth 
of the teachers (7) said they needed more than 40 minutes, and three of those teachers 
needed one hour or more. The longer time periods taken to complete the questionnaire may 
be another indication that some of the teachers had difficulty or were at least unfamiliar 
with some of the tasks on the questionnaire. 

Although FWL staff acknowledges that a six-page questionnaire could be considered 
lengthy, we feel that all of the information on the questionnaire is important for an 
observer to know before conducting an observation. In addition, even though the task of 
describing (on a chart) the objectives, activities, student groups, materials and equipment, 
and safety issues for the laboratory activity may be time-consuming, it is a task which, we 
believe, all science teachers should be able to complete in a compett.it manner. 



4.39 

115 



Clarity of the Conference Questions 



The Science Laboratory Assessment includes pre- and post-oUervation conferences 
conducted by the observer with the teacher. The pre-observation conference, which is 
conducted after the observer has read the teacher's questionnaire, consists of 15 questions, 
and the post-observation conference consists of 8 questions. The data gathered from both 
conferences is used in the analysis and rating process of the assessment. 

Both the teachers and the assessors were asked if they had any difficulties with the 
two conferences, and almost all of them said "no." Only „wo teachers and three assessors 
stated they had difficulty with either the Pre- or Post-Observation Conference. One of the 
two teachers expressed displeasure with not receiving any feedback after the Post- 
Observation Conference, and the other teacher complained of "too little time" allotted for 
the Post-Observation Conference. For the three assessors who had difficulty, lack of time 
was also an issue. One assessor commented: 

I found the pre-ob [conference] took a minimum of 30 
uninterrupted, on- task minutes. Most of my teachers only 
allowed 20 and were doing other things (e.g., supervising a 
class or writing up plans) at the same time. 

Although the developers of the assessment recommend 30 minutes for the pre- 
observation conference, the teachers were usually scheduled for a 20-minute conference. For 
some of the teachers, 30 "free" minutes were hard to find. For example, if the teacher was 
scheduled to be observed in the afternoon, the pre-observation conference usually took place 
during the teacher's lunch hour. Since many lunch hours are closer to 40 minutes than an 
hour, teachers were not asked to give up the majority of their lunch hour for the conference. 
Almost half (5) of the eleven assessors, however, indicated that, on average, they needed 30 
to 40 minutes to conduct the Pre-Observation Conference, and more than half (8) needed, 
on average, at least 20 minutes to conduct the Post-Observation Conference. 

When asked if there were any conference questions with which the teachers 
consistently had difficulty, approximately three-fourths of the assessors (8 of 11) said "no." 
Three of the assessors disagreed, naming pre-observation conference questions #4, #6, #8, 
and #9, and post-observation conference questions #2, #3, and #4 as sources of difficulty 
(see Appendix B). One assessor explained that #6, which in part asks, "What do students 
already know about this topic?", was difficult because the teachers "cannot say with accuracy 

4.40 



9 

ERIC 



what students already know about the topic." Another assessor commented that #8 and #9, 
which ask the teacher about future instruction, broad goals, and linkage between concepts, 
were difficult because the teachers he observed "generally indicated thv iidn t think about 
broader goals, the bigger picture, and connections." 

The assessors were also asked if they thought any of the conference questions could 
be eliminated or collapsed. Although four assessors said "no," and one assessor didn 't 
respond, six of the assessors suggested changes to the pre-observation conference questions. 
Among these six assessors, there was some agreement as to the questions which should be 
changed, but there was not always agreement as to how they should be changed. For 
example, the question which received the most suggestions for change was #6 which reads, 
"What prior instruction have you implemented related to the lab activity? What do students 
already know about this topic?" The three assessors who targeted this question, however, 
each suggested a different change: 1) eliminate the question, 2) eliminate the second part of 
the question, and 3) collapse the question with #7 to form a new question. 

Altogether, eight of the 15 pre-observation conference questions were targeted for 
change by one or more of the assessors. Table 4.4 shows the questions recommended for 
change, the number of assessors who wanted to change the questions, and the changes 
suggested. FWL staff concur with the assessors who believe the assessment instrument 
would be improved if (1) changes are made to the pre-observation questions, and (2) the 
focus of the changes should be on improving the clarity and reducing the complexity of the 
questions. To this end, we recommend that some of the questions be collapsed (i.e., #6 and 
#7; #4 and #10), and that #9 should be into two questions or the second part should be 
eliminated. We also recommend that if the pre-observation conference can be shortened by 
eliminating or collapsing questions, it should be. Careful review of all the answers given by 
the teachers to the pre-observation conference questions could provide useful insights as to 
how the questions can best be changed. 

Clarity of the Forms and Proesif for Documentation and Rating 

In order to document and rate a teacher's performance, the format of the Science 
Laboratory Assessment requires the assessor use three forms: 1) the Guided Note-taking 
Form, 2) the Documentation Sorting Record, and 3) the Summary Report Form. Because 
these forms were not provided to or used by the teachers, the teachers were not asked to 
evaluate them; hence, there is no teacher feedback included in the discussion of the forms. 



4.41 



117 



TABLE 4.4 



SUGGESTED CHANGES TO PRE-OBSERVATION CONFERENCE QUESTIONS: 

SCIENCE LABORATORY ASSESSMENT 



14S 



:I '. & :::::;^QuostjonV : '.:', 1 . 


li^r^'bhafige^l 


Suggested Qhaftgfcy 


1 . 1 have reviewed your Questionnaire. Is there anything on it you 
:.eed to change before we continue? 


1 


-eliminate 


4. Explain the scientific concepts and/or skills you are 
teaching in this lab activity. 


2 


-collapse with #10 


5. What are some of the incorrect preconceptions that students 
may have that relate to this activity? (pause) How do 
you plan to address these during the lesson? 


1 


-collapse with #9 


6. What prior instruction have you implemented related to 
<e lab activity? (pause) What do students already know 
about this topic? 


3 


-eliminate; eliminate 
2nd part; collapse 
with H7 


7. Have you provided previous instruction to ensure that students 
have the technical skills (e.g., students know how to use 
a voltmeter) requisite to the successful completion of this 
laboratory activity? If yes, was this provided recently? 
If not, what techniques have you employed to provide you 
evidence that students are ready to use the required processes 
and technical skills? 


2 


-eliminate; collapse 
with #6 


8. What instruction are you planning to do in the future related 
to this activity? 


1 


-collapse with #9 



9 

ERIC 



TABLE 4.4 (continued) 



SUGGESTED CHANGES TO PRE-OBSERVATION CONFERENCE QUESTIONS 

SCIENCE LABORTORY ASSESSMENT 





No? of Assessors ^ 
. wanting cnaoge 


suggest eo unanges 


9. What is the relationship or contribution of this laboratory 
activity to the broad goals for the students' learning? (pause) 
Does it provide linkage from one concept to the next, or is 
part of a continuing direction within one major concept? if yes, 
please explain. 

10. What advanced thinking skills (e.g., comparing, estimating, 
inferring) will students be encouraged to use or required to 
apply in order to productively participate in this activity? 


2 
2 


-make into two 
questions; collapse 
into #5 and #8 

-collapse with H 



(I 



151 



Guided note-taking form. As described at the beginning of this chapter, the Guided 
Note-taking Form' (from here on after referred to as GNF) is the form used by the assessor 
to record that which is seen and heard during the observation. The form is divided into 
seven spaces, each space corresponding to one of the assessment's seven domains. During 
the observation, the assessor is expected to simultaneously record and categorize the 
evidence/notes from the observation into the appropriate space, (i.e., the appropriate 
domain). On average, an observer will record evidence and notes on 12-15 of these forms 
during a single observation. 

During the assessment training, the assessors expressed much frustration using the 
forms. Although many had experience scripting observations (i.e., writing down everything 
they saw/heard during an observation in a chronological manner), none had experience 
scripting and categorizing information at the same time. The trainers acknowledged the 
frustration, but instructed the assessors to continue using the forms in the hope that with 
practice the assessors would become more comfortable and proficient in using the forms. 

In fact, many of the assessors did become more comfortable and proficient using the 
form with practice, as indicated by assessor comments such as these: 

/ was confused at times, but felt more comfortable the more I 
did it. 

It got easier to use the GNF with experience. 

Nevertheless, when asked if they had any difficulties with choosing the category in 
which to record evidence, 82% of the assessors (9 of 11) said "yes." For almost all of these 
assessors, the difficulties were a result of 1) not being clear on what the domains and 
elements meant, and 2) not being clear as to what to do with evidence that, in their opinion, 
fit into more than one domain. For example, in the comment below, an assessor explains 
how he was unclear about the meaning of elements within and across domains: 

In the Pedagogy domain I had difficulty distinguishing between 
the elements, Directions" and "Explanations/Presentations.'' I 
also feel that Pedagogy's element, "Questioning, " overlaps 
Climate's "Inquiry element 



4.44 



152 



Similarly, in the next comment, another assessor describes his difficulty choosing in 
which domain and elements to record evidence: 

I had problems when dialogue would fit into more than one 
category such as Climate's "Interactions with Students," and 
Pedagog/ s "Monitoring and Adjusting" or "Feedback".... 
Many times I would put the dialogue into both. 

Clearly, as was discussed in the section, "Assessors and Their Training," the 
assessors would have benefitted from more training in the meaning of the domains and 
elements, as well as what to do if evidence falls into more than one category on the Guided 
Note-taking Form. Indeed, because the training did not directly address this issue, nor was 
it addressed at all in the Assessor's Handbook, it is hard to evaluate whether tbo difficulties 
described above in using f he Guided Note-taking Form are inherent to the form itself, are a 
result of the training, or Doth. 

In addition to the difficulties described above, approximately one third (4 of 11) of 
the assessors described difficulties that had little or no relationship to the training. One 
assessor, for example, emphatically expressed difficulty not with choosing the categories, but 
with trying to observe, write, and categorize the evidence simultaneously: 

It is very hard to move around the class and see the specific 
categories, and record exact quotes. 

Another assessor also expressed difficulty with the guided note-taking format, 
commenting that, because there is "no requirement to note what the teacher does not do, if 
the scripting is not complete, there is no way to know." In other words, because the GNF 
format requires the observer to categorize evidence as it is observed instead of scripting the 
entire lesson in complete chronological order, the GNF format misses the flow and 
continuity of the lesson and thus increases the chance that the observer may miss 
information about what the teacher has not done at a particular point in the lesson. 

Finally, other assessors found fault with the form itself, and offered suggestions, 
such as the following, as to how the form could be improved: reduce the size of the margin 
and the amount of information (e.g., title of form, slots for names, time, and date) at the 
top; and add to the top of the form a slot for the setting (e.g., whole-group, small-group). 



4.45 



153 



Should the Guided Note-taking Form be retained as part of the Science Laboratory 
Assessment, FWL staff agrees that the form would be improved by following the above 
assessors ' suggestions. 

Documentation sorting record. The seven-page Documentation Sorting Record (DSR) 
is the second step in the Science Laboratory Assessment's documentation and analysis 
process. After using the Guided Note-taking Forms to categorize the evidence/notes from 
the observation by domain, the observer uses the DSR to further sort the evidence/notes by 
element. The observer also uses the DSR to sort by element all of the information collected 
from the pre- and post-observation conferences and the questionnaire. As there ere a total 
of 33 elements, completing the DSR requires a considerable amount of work by the observer. 

As with the Guided Note-taking Forms, the assessors were asked to describe any 
difficulties they may have had using the Documentation Sorting Record. The major 
difficulty, cited by all but two of the eleven assessors, was that the DSR process, is "time- 
consuming" and "very laborious." Assessors claimed the DSR took them from between two 
and five hours to complete. An assessor who completed four observations remarked, 

Frankly, this part of the process I found an absolute "tear." It 
takes a long time, 2-4 hours, to get through the sorting record. 

One assessor pronounced the DSR to be the "weakest link" in the assessment process 
"because of length and consequence of time." 

To reduce the time, some of the assessors suggested changing the DSR process. A 
couple of assessors advocated omitting the DSR completely and just relying on the GNF. As 
one assessor explained, "I felt that my original notes on the GNF were legible and clear so 
that I could go directly to the Summary Report Form." Two other assessors suggested that 
the DSR be used more selectively, such as to note "only critical (+ or -) evidence" or to use 
the form only when there is a "potential 1 rating in any domain and a possible werall rating 
ofl." 

While FWL staff agrees that the DSR is a time-consuming process, we do not think 
that there should be total reliance on the GNF because many assessors ' notes are not legible 
and clear during this step of the process. In fact, it is probably unrealistic, if not unfair, to 
ask an assessor to not only try to script an observation, but to categorize the evidence/notes 
as they are recorded, soi to do all of this in a clearly, legible manner. Assessors often use a 

4.46 



9 

ERIC 



151 



personal shorthand during this process which enables them to capture more evidence/notes, 
If an assessor was required to always write in a clear and understandable manner at this 
step of the process, there is a good chance that the assessor, while focusing on legibility, will 
miss evidence. 

For much the same reason, FWL staff believe that the DSR should not be used for 
only some evidence or certain ratingB. An independent reader should be able to read 
evidence pertinent to ail ratingB (domains and "overall"), and this may not be possible if the 
assessor's writing is not legible on the GNF. Furthermore, if an assessor only writes 
"critical" evidence, there would have to be a clear understanding of the meaning of the word 
"critical," and this would likely add another subjective component (i.e., the assessor's 
judgement of what evidence is critical) to the assessment. 

Another difficulty, cited by several assesso i (3 of 11) and which perhaps contributed 
to the length of the DSR process, was that of deciding how pnd where to write the evidence. 
As suggested by the following assessor, this difficulty may nave partly been a result of the 

training: 

J had difficulty deciding what to put down and how to get it 
down on paper. The concept it clear but I didn't feel the 
training we had was complete enough to enable me to do this 
step comfortably. 

FWL staff agrees with this assessor that more training in sorting and writing up 
evidence was needed, and believes that such training would greatly reduce or eliminate the 
difficulty described above. 

FWL staff further believes that consideration should be given to revising the 
Documentation Sorting Record so that it is more than just a place to recopy evidence/notes 
and to read all the evidence/notes together. It does not seem worth two hours of an 
assessor's time to basically recopy notes. The DSR should be revised to somewhat resemble 
the second step of the Classroom Competency Instrument, an assessment instrument pilot 
tested last year. That is, the second step in the documentation process would require the 
assessor to not only sort evidence by element, but to also differentiate whether the evidence 
was positive or negative. Furthermore, the way in which the evidence is written could be 



4.47 

155 



specified so that 1) the assessor does not have to write all evidence but only that which best 
exemplifies (positive and negative) the elements, and 2) there is some consistency among 
assessors' write-ups. Such revisions, we believe, would greatly enhance the assessment 
process as a whole and the DSR in particular. 

Summary report form* The Summary Report Form constitutes the third and final 
step in the Science Laboratory Assessment's analysis and rating process. On this two-page 
form, the assessor records a rating of the teacher's performance for each domain and 
assigns an overall rating to the teacher's performance. The assessors are asked to choose 
between two possible ratings: a "2" rating indicating minimal competency, and a 1" rating 
indicating a lack of minimal competency. If a choice between the two ratings can not be 
made because of a borderline performance or a lack of sufficient information, then the 
assessor gives an "X" rating. After making the rating, the assessor writes three or four 
summary remarks corresponding to that domain and rating (or the overall rating). 

The assessors were asked if they had any difficulty with 1) recording summary 
remarks for eacn domain, 2) assigning a rating for each domain, and 3) assigning an overall 
rating for the teacher. An overwhelming majority said "no" to each of the above. In fact, 
only one of the eleven assessors expressed any difficulty recording summary remarks, only 
one had difficulty assigning a rating for each domain, and no assessor had difficulty 
assigning an overall rating. Of the assessors who did experience difficulty, one was not 
clear on the difference between a minimally acceptable and not acceptable performance; the 
other was not clear on how to write the summary remarks. The latter explained: 

My concern was that I was missing something important that 
should have been picked up or that I was somehow mishandling 
the evidence, 

This assessor added that it would have been helpful during training to have had several 
examples of how different observers write summary remarks. 

The assessors were also asked to suggest which domains, if any, should receive 
more/less weight when considering the overall rating of the teacher. Of the eleven 



4.48 



assessors, three felt none of the areas should receive more/less weight, and two assessors did 
not respond to the question. The answers of the remaining six assessors are listed below, 
together with the number of teachers who gave them. 



Most Emphasis 



I^ast Emphasis 



Pedagogy 
Content 
Management 
Materials/ 

Equipment 
Climate 

Communication 
Knowledge of 
Students 



(4) 
(4) 
(4) 



(3) 
(2) 



Knowledge of Students 
Climate 

Communication 
Materials/ 

Equipment 
Management 
Pedagogy 
Content 



(3) 
(2) 
(2) 



(1) 
(1) 
(0) 
(0) 



(0) 



Some of the assessors explained why they thought certain domains should receive 
less emphasis. For example, one assessor thought; the Knowledge of Students and Climate 
domains should receive less emphasis because "they are really included in Pedagogy, 
Management and Communication." This thought was echoed by another assessor who 
suggested the Knowledge of Students domain receive less emphasis because "beginning 
teachers have difficulty sorting this out from their overall pedagogy and management." One 
assessor proposed that the Management and Communication domains receive less emphasis 
because "they develop after your first year of teaching." Finally, the assessor who 
nominated the Materials/Equipment domain for less emphasis did so because the "safety 
aspect of a teacher's handling of materials cannot be determined in some observation 
activities." 

FWL staffs analysis of the assessors ' Summary Report Forms raised several 
concerns, the majority of which revolved around the assessors ' summary remarks made for 
each domain. Focusing on the Summary Report Forms for the five achers who wert 
double-scored, we found tremendous differences in the summary remarks written by the 
assessors. For example, for the high school biology teacher who instructed a lab on proteins 



4.49 



157 



and oils in the digestive system, the two assessors who observed him each gave him a "2" 
rating in the Materiali/Equipmant domain. Their summary remarks corresponding to this 
rating, however, were vastly different. One assessor wrote: 

A good deal of work done to organize materials for the lab. 
The other assessor's remarks were as follows: 

The teacher provides verbal safety instructions and points out 
location of safety and clean-up items. The teacher monitors 
student use of chemicals during the lab and has prepared and 
allocated materials so as to save time during the lab. Students 
might be given more responsibility in labelling tubes for the lab 
thereby decreasing prep time for teacher. All students assist in 
an orderly clean-up. 

Although both assessors agree that the teacher is at least minimally competent (i.e., 
merits a "2" rating) in this domain, we get two different pictures of the teacher's competency 
when we read the two assessors 9 summary remarks. The first assessor 's remark gives 
limited information about the teacher's performance in this domain, and equally limited 
information to support the "2" rating. Vnat does the assessor mean by "a good deal of 
work"? What about the teacher's performance in the other elements of the domain-e.g., Did 
the teacher set the materials up safely? Did the teacher and students use the materials 
safely? Were the materials available to all students? The first assessor's summary remark 
does not really support or explain why the teacher received a "2" rating. The second 
assessor's comments, however, give a much richer description of the teacher's performance 
and competency, and offer support and explanation of why the teacher received a "2" rating. 
Although the second assessor could be faulted for including a suggestion about how the 
teacher could reduce prep time, the second assessor's remarks seem preferable to those of 
the first. 

To further illustrate differences in summary remark write-ups, the following are two 
assessors' summary remarks addressing a Kindergarten teacher's performance in the 
Content domain (both assessors gave the teacher a "2" rating): 



4.50 



First assessor: 

Teacher does this lab as part of AIMS unit on aeronautics. It 
shows that air is energy and can be used to do work. Good lab 
to show that concept at this level. Integration with other 
units/subjects was weak. 

Second assessor: 

I noted that the teacher was able to relate easily to previous 
and future direction "Scientists today," "role playing yesterday," 
"how far can you, " and "lift. " 

Of the first assessor's four summary remarks, only one directly addresses any of the 
three elements (i.e., Integrated) of the Content domain. The other three remarks are more 
descriptive of the lab than of the teacher's performance. The second assessor offers only 
one remark, also addressing only one of the domain's elements (Integrated). While this 
assessor's remark is not especially clear, it does include verbatim examples in support of the 
assessor's remark. 

Thus, while the assessors expressed no difficulty with writing the summary remarks 
on the Summary Report Form, there is very little consistency among assessors as to what is 
written and how it is written. Furthermore, more often than not, the summary remarks do 
not seem to summarize the teacher's performance or adequately support/explain the rating 
given. As presently constructed, the RMC rating process is essentially a pass/fail system 
which does not provide information that differentiates among teachers who are at different 
levels of performance within the domain. Thus, without major revisions, the state could not 
use this instrument to increase teachers 'competencies because there is no basis upon which 
to do so. 

FWL staff suggests that the rating process of this assessment undergo extensive 
revision so that there is (1) a clearer picture of what a "2" performance looks like , and (2) 
the instrument could be used to increase teachers 'competencies (e.g., through staff 
development). One possibility for revision is suggested by the state's 1990 Science 
Framework. Taking the state 's criteria for adoption of instructional materials as a basis, 
the RMC assessment's rating process might be revised to include at least three domains, 
each of which is weighted (which may be done with points). The first domain, for example, 

4.51 



159 



could be Content and it would carry the most weight (or points). Within this domain, there 
would be at least four elements which cover accuracy, depth, use of themes, and process. 
Each of the four elements could also be weighted (e.g., accuracy would be more important 
than depth). The other two domains could be Presentation and Pedagogy, both of which 
might carry the same weight. Presentation would include elements which address, for 
example, communication, attitudes toward science, explanation/presentation, and 
questioning. Pedagogy would include elements which might address grouping, feedback, 
student engagement, and knowledge of students. The Science Laboratory Assessment's 
domain of Materials/Equipment might serve as a fourth domain, or it could be included in 
the other domains (e.g., in the Content domain under the accuracy element-materials and 
equipment are handled in a correct manner by teacher and students). Many of the 
assessment's other elements could also be subsumed under the above domains. 

With such a rating process as described above, the assessor would be firmly guided in 
making his/her rating decision and the rating results would more likely show greater 
differences among teachers' performances. While we can not advocate such a process 
without further and extensive study, we do recommend that strong consideration be given to 
reducing the number of domains for which ratings are given, to weighting the domains 
(especially the Content domain), and possibly to weighting the elements. Whatever 
revisions are made or considered, the end goal should be to produce an assessment in which 
a "2" rating, for example, is meaningful and consistent across assessors and teachers. 

Cost Analysis 

Based on our experience pilot testing both this version of the Science Laboratory 
Assessment and the Connecticut Competency Instrument (CCD in 1989 we have outlined in 
this section estimates for administering and scoring this laboratory observation assessment 
and summarized costs for the development and pilot testing of this prototype. These costs, 
however, should be taken as only preliminary estimates for what costs would be incurred if 
an assessment like this were to be further developed and modified for implementation on a 
wide scale. 

Administration and Scoring Cost Estimates 

Assessor time and costs. Administering this assessment requires a trained observer- 
assessor to (a) prepare and arrange for the assessment, (b) review the pre-observation 
questionnaire form, (c) conduct the pre-objective conference, (d) conduct 30- to 45-minute 

4.52 



ERIC W> (J 



observation, (e) conduct a post-observation conference, and (0 summarize the evidence/notes 
taken during the observation and rate the teacher's performance. These activities take 
approximately 4-6 hours for each observation. Thus, using an hourly rate of $20 per hour 
implies that it will cost approximately $100 per observation to conduct this assessment. 

Training costs for assessors* The training for this assessment consisted of one home- 
study day and two days of group training. As related earlier, we do not believe this is 
sufficient time to train assessors to reliably and validly score this assessment. At a 
minimum, the training should be extended by one day, and we believe that it will ultimately 
need more like the five days used for the CCI training which also has a two-day follow-up 
session. For estimates here, we will assume that the training will take four days with no 
follow-up training needed. If each assessor-observer conducts 30 observations each year for 
five years, we can distribute the training costs over 150 observations. Reimbursing the 
assessors for the four days of training at $20 per hour would add about $4 to the cost of 
each assessment. 

Other costs. Other costs include those associated with the telephone, duplication, 
postage, and travel. Travel could be expensive in California unless regional assessors were 
used. Estimating costs of these activities or ingredients would depend in large part on the 
manner in which the system was ultimately designed and how costs were apportioned. 
Using a figure of $30 per assessment for these activities would assume minimal travel costs, 
based on our experience from pilot testing. 

The above estimates imply that the costs for administering and scoring each 
assessment could be approximately $134. This figure should be taken as only an initial and 
rough estimate. Actual costs would depend largely on the rates and methods for paying 
assessors, whether the assessments were administered with local or centrally based 
assessors, and the degree to which the training and administration times for the final 
assessment were within the estimates used here. 

Development and Pilot Testing Costs 

Although the development and pilot testing of the Science Laboratory Assessment 
was much closer to a research and development stage than an implementation stage, it still 
may be helpful to report the costs associated with developing and pilot testing this 
prototype. 



4.53 



lfil 



Development of this prototype, as described earlier, drew substantially from the 
experience of other teacher performance assessment systems such as the CCI. Thus, the 
development for this assessment benefitted from the prior development and materials 
available from these earlier efforts. Costs for Development and for Pilot Testing are 
outlined in Table 4.5 in terms of the developer and pilot test staff time, consultants to the 
developers and in the pilot testing (e.g. consultants in the pilot testing include costs for 
reimbursing the teachers and assessors participating in the pilot test), travel, and other 
direct costu for items such as phone, duplication, facilities, etc. 

Cost Summary 

The experiences from pilot testing a limited number of the Science Laboratory 
Assessment provides some initial estimates that might be expected with developing and 
implementing an assessment of this type. The development and pilot testing costs could be 
reduced with a larger scale and more advanced stage of development. Similarly, the costs 
for administering and scoring will depend on the number of teachers to be assessed, the 
location and costs associated with training and supporting assessors, and the methods used 
to pay these costs. For example, using retired teachers versus the use of practicing teachers 
as assessors and providing substitute teachers, would likely result in different costs. 

Technical Quality 

This section briefly discusses three technical issues related to the Science Laboratory 
Assessment - development, reliability and validity. 

Development 

Development of the Science Laboratory Assessment began in 1989 in response to a 
request for proposals from the California SDE/CTC. Sever* sources of information were 
utilized in developing the assessment materials and procedures to be pilot tested in the 
spring of 1990, including reviews of literature on effective science teaching, other teacher 
performance assessment systems, textbooks on science teaching methodology, California's 
curriculum guides and framework for science, and California's standards for beginning 
teachers. Over 100 California science educators were involved in the development of the 
assessment, either as members of the Assessment Development Committee or as reviewers 
of the assessment materials. 



4.54 



162 



TABLE 4.5 



DEVELOPMENTAL AND PILOT TEST COSTS FOR THE 
SCIENCE LABORATORY ASSESSMENT 



>st Categories 
Staff-Salaries & Benefits 


: :v: : : : :o" ; ' ; :'X ; : : : : :'': : :x: : S: : :':': : ; 


Development 

$41,972 


PHotTestffiB 

$ 9,869 


Consultants 
(Teachers, assessors, 
and other consultants) 




0 


9,515 


Travel (Consultants and 

SUM! J 




0 


3,664 


Other Direct Costs (Site 
rental, phone, 
duplication) 




7,790 


2,348 


Total Direct Costs 




$49,762 


$25,396 


Indirect Costs 




9,903 


6,928 


Total Costs 




$59,665 


$32,324 



4.55 



in 3 



Although it was not possible for all Committee members to conduct tryout 
administrations of the assessment materials, several members did some type of activity 
related to tryout administrations. Only two committee members.however. completed a full 
assessment with a new teacher. Other members asked new teachers at their school to try 
out some part of the assessment (e.g., the questionnaire) or used the materials for self- 
assessment. As a result of these efforts, several modifications were made to the content and 
format of the materials. 

Concurrent with the tryout administrations, a statewide review of the assessment 
materials was conducted by 63 science educators and scientists throughout California. 
Reviewers were asked to comment on several aspects of the assessment including the job 
necessity and appropriateness for new teachers of the domains and elements covered by the 
assee , int. The developers reported that, overall, the reviewers seemed very positive about 
the materials and felt that the elements were necessary for effective teaching of a science 
laboratory activily «>d were appropriate to expect of a new teacher. 

Reliability 

The data reported in Figure 4.3 summarized the performance of the teacher 
candidates on this assessment. Since nearly all teachers passed most or all parts of the 
assessment, no further analysis was done to estimate the reliability of the assessment The 
five instances in vrK, ^h two observers observed a teacher resulted in both observers rating 
the teacher as passing. 

The pilot testing does not provide sufficient information upon which to judge the 
reliability of the assessment. It is not possible to determine at this point whether th« 
teacher's performance reflects, (a) the degree to which all or nearly all teachers possess the 
skills reflected in the assessment, (b) the absence or unclarity of criteria for rating teache^ 
which resulted in assessors assigning passing scores and being reluctant to assign a faih + 
score in the absence of more definitive criteria, or (c) a need to build in greater range in 
rating which would allow assessors to better discriminate among teachers with differing 
levels of competence in those skills measured. 

Two factors mentioned earlier can improve the scoring and information from this 
prototype assessment. First, more explicit criteria for scoring with supporting examples 
needs to be developed and incorporated into the training. This includes having assessors 
better summarize their observations by listing those factors which support and which 

4.56 



164 



indicate the deficiency of teachers on each of the factors. Second, consideration should be 
given to expanding the range of ratings so as to avoid the "ceiling" affect observed here 
which all teachers were rated similarly (i.e., as passing.) 

Validity 

The above section which describes development and background of this assessment 
provides information on the developers' involvement of science teachers and experts in the 
development of the prototype. This involvement contributed to the assessment's alignment 
with the curriculum frameworks and teaching standards which has been described. Thus, 
this information supports the content validity and focus of the assessment on important and 
current approaches to teaching science in laboratory settings. 

Revisions to the scoring criteria and training may result in an assessment which 
better differentiates among teachers who are likely to have different degrees of skills in the 
areas examined by the prototype assassment. However, the pilot test only yielded 
information sufficient to say that the new teachers who participated in this pilot test were 
acceptable on the criteria as currently constituted. 

Conclusions and Recommendations 

This section contains conclusions and recommendations regarding the Science 
Laboratory Assessment, organize -3 into the areas of administration, content, format and a 
brief summary. 

Administration of Assessment 

As is often the case with high-inference observation instruments, the administration 
of the Science Laboratory Assessment is labor intensive, requiring nearly one professional 
person day per teacher. For this pilot test, each of the 11 experienced, science teachers who 
served as assessors agreed to conduct a minimum of three observations (i.e., take three days 
off from their teaching job or other work) during a six-week period. Few assessors were 
willing to leave their classrooms for more than the three days because of the difficulties they 
experienced trying to combine the administration of the assessments with the execution of 
their teaching duties. Therefore, should an observation system such as the Science 
Laboratory Assessment be considered for credentialing use in the state, we recommend the 
following: 

4.57 

1KJ5 



• careful design of observation schedules for assessors to allow 
assessments to be distributed in a reasonable manner. 

• consider expanding the recruitment pool of possible assessors to retired 
science teachers, science teacher supervisors, teacher trainers and 
others in addition to practicing teachers. 

In addition to the above, the following factors seem to be key to smooth 
administration of the Science Laboratory Assessment in its present form: 

• recruiting assessors who have expertise in more than one area of 
science (e.g., chemistry and physics) and/or experience teaching at or 
with different grade levels (e.g., high school and middle school) so as 
to allow more flexibility in the scheduling of observations; 

• development of procedures for obtaining completed assessment 
materials from assessors in the Held; and 

• arrangements for storage of a large amount (at least 25 pages) of 
documentation per teacher. 

Finally, since the Science Laboratory Assessment is administered and scored by the 
same person, the training of assessors is also a key factor to successful administration of the 
assessment. Through training, assessor candidates are taught the content of the 
assessment, as well as how to conduct and score the assessment. For this pilot test, 
training consisted of one home study day and two days of group training. However, based 
on assessors' comments, FWL staffs observation of the training, and FWL staff's review of 
the assessor's completed documentation forms, three days of training does not appear to be 
sufficient. Any future training might incorporate the following recommendations: 

• increase group training time to no less than three days and possibly to 
five days; 

• increase the training time allotted for introduction to, or review of, the 
assessment's content so that att of the participants agree on the 
definitions of the elements; 



4.58 

lac 



e include in the training more explicit instruction and examples on 
recording evidence and notes on each of the assessment forms, as well 
as on evaluating teacher performances; and 

• increase the training time allotted for practice using the different forms 
to score teacher candidates and discussion of the results of this practice. 

Following the above suggestions should greatly facilitate the administration of the 
assessment. 

Assessment Content 

Based on the observations of FWL staff, as well as information collected from 
assessors, teachers, and the assessment documentation (e.g,, rating forms), the following 
conclusions are offered about the content of the Science Laboratory Assessment: 

• Congruence of the Science Laboratory Assessment with the 1990 
California Science Framework, Kindergarten through Grade Twelve, 
can best be described as partial. One way to strengthen the congruence 
would be to weave the idea of science themes-a major emphasis of the 
framework - throughout more of the assessment (e.g., include as part of 
the elements and conference questions whenever possible). 

• Coverage by the Science Laboratory Assessment of the California 
Standards for Beginning Teachers is relatively good. Coverage is 
particularly good for those standards which focus on student rapport 
and classroom environment, diverse and appropriate teaching, student 
motivation and conduct, presentation skills, and cognitive and affective 
outcomes of teaching. Coverage is partial for those standards 
addressing curricular and instructional planning skills, student 
diagnosis, achievement and evaluation, and a teacher's capacity to teach 
crossculturally. 

• The job-relatedness of the Science Laboratory Assessment seems to be 
high because the assessment entails observing teachers actually teaching 
in their own classrooms. 



4.59 



U17 



Overall, the content of the Science Laboratory Assessment does not 
seem too difficult for beginning teachers. Approximately 93% (27 of 29) 
of the pilot test participants received overall passing scores (i.e. .received 
an overall rating of "2"). Furthermore, at least 84% of the teachers 
passed each one of the assessment's seven domains. 

Analysis of the rating results by grade level (i.e., elementary, middle 
school, and high school) indicates that elementary teachers of science 
(i.e., those who have been trained to teach science at the elementary 
level) did as well or better on the assessments as did middle school and 
high school science teachers. Th>as, the assessment seems an 
appropriate one for teachers of science at all grade levels. 

Acknowledging the increasing diversity in California's classrooms, the 
developers of the Science Laboratory Assessment included in the 
content of their assessment a domain specifically targeted to assessing a 
new teacher's ability to work with diverse students. This domain, 
however, was named by almost half the assessors as the hardest domain 
to rate because it depends as much on the assessor's knowledge of 
students in the classroom as on the teacher's knowledge. Thus, it is 
questionable whether the domain, as it is currently written, is an 
effective way of assessing a beginning teacher's ability to work with 
diverse students. 

The assessment was deemed by the teachers and assessors to be fair 
across groups of teachers (e.g., different ethnic groups, different 
language groups). However, as one teacher pointed out, true fairness 
depends on assessors' awareness of different teaching styles, especially 
with regard to management, climate, and communication. Without such 
awareness, the likelihood increases that the assessor may misinterpret 
or miss pertinent evidence. 

> The majority of teachers and assessors think the Science Laboratory 
Assessment is an appropriate way of assessing (1) general teaching 
skills, and (2) skills in teaching laboratory science. 



4.60 

ICS 



Assessment Format 



One strength of the format of the Science Laboratory Assessment is that its focus is 
not on a simulated performance, or on how a teacher says s/he would perform, or on a 
teacher's knowledge of how to perform, but rather on a teacher's actual performance in the 
classroom. In addition, because the teacher is observed in his/her own classroom, no special 
facilities are required for administration. 

Another strength of the format is that it actually includes two methods of 
assessment: observation and interview. The pre-and post-observation conferences which 
are part of the assessment are designed to (1) help the assessor understand the instructional 
goals and classroom context which affect the lesson design, and (2) give the teacher an 
opportunity to explain and justify changes in the original lesson design in response to 
unanticipated circumstances, as well as to reflect upon the lesson as it was conducted. The 
information provided in the two interviews and through the Pro-Observation Questionnaire 
(which is completed by the teacher before the observation) allows the assessor to 
conditionally evaluate teacher behaviors in light of differing instructional goals and 
classroom contexts. This type of observation instrument is superior to others usod in 
teacher assessment because it focuses on the meaning rather than frequency of teacher 
behaviors. 

Despite the above strengths, comments from the assessors and an analysis of the 
completed documentation and rating forms indicate that the format of the Science 
Laboratory Assessment could be improved in several ways. We suggest that consideration 
be given to following these recommendations: 

• Shorten the Fre-Observation Conference either by eliminating or 
collapsing some of its 15 questions. Changes should be especially made 
which focus on improving the clarity and reducing the complexity of the 
questions. 

• Because 82% (9 of 11) of the assessors said the Guided Note-taking 
Form (the form used by the assessors to simultaneously record and 
categorize the evidence/notes from the obsei. vation) was a source of 
difficulty, either training should be designed to specifically focus on the 
problems experienced by the assessors or the form should be greatly 
revised (and training should be designed to cover the revisions). 

4.61 

1KU 



• The Documentation Scoring Record should be revised so that the two to 
four hours it takes to complete the form results in more than just a 
recopying of the evidence/notes from the various assessment 
components (e.g., observation, conferences). Perhaps evidence could be 
sorted not only by element, but also by whether it is positive or 
negative. Furthermore, not all evidence would have to be included, but 
only that which best exemplifies (positively and negatively) the 
elements. 

• Although almost all of the assessors did not have difficulty rating the 
teachers* performances on the Summary Report Form, there was such 
little consistency in how the assessors wrote their summary remarks to 
support their ratings that this process needs substantial improvement. 
Any future training should include sufficient instruction as to how to 
write the summary remarks so that they (1) summarize the teacher's 
performance, and (2) adequately support/explain the assessor's rating. 

• Because the Science Laboratory Assessment provides a wealth of 
information about the teacher's performance, the assessment's rating 
procew should be revised so that this information is better utilized. 
The rating process might be revised from what is now essentially a 
pass/fail system to ont which differentiates among teachers who are at 
different levels of performance. In this way, the instrument could also 
be used to increase teachers' competencies (e.g., through staff 
development) evidence/notes from the various assessment sources (e.g., 
observation, conferences). 

Summary 

If an observation system such as the Science Laboratory Assessment is selected as a 
method of assessing new teachers of science (or of other subjects) for credentialing purposes, 
the Science Laboratory Assessment could serve as a base upon which to build a fully 
developed prototype, but only after substantial revisions have been made to its 
documentation and rating processes. 



4.62 



9 

ERIC 



170 



CHAPTER 6: 



ASSESSMENT OF COMPETENCE IN MONITORING STUDENT ACHIEV] 

IN THE CLASSROOM 



171 



CHAPTERS: 



ASSESSMENT OF COMPETENCE IN MONITORING STUDENT ACHIEVEMENT 

IN THE CLASSROOM 



The Assessment of Competence in Monitoring Student Achievement in the 
Classroom, designed by the Northwest Regional Educational Laboratory (NWREL)> consists 
of a set of ten exercises to which the teachers respond in writing* The assessment is built 
around a staff development component that provides teachers with instruction on measuring 
classroom achievement. This instruction had been previously developed by NWREL as part 
of a decade-long analysis of the task demands of classroom assessment conducted by 
NWREL (Stiggins, Conklin and Associates, in press). In the pilot test, pre* and posttests 
were given to both a set of teachers who participated in the staff development activities and 
to another set of teachers who did not. Two parallel forms of the instrument were 
developed. The two forms were distributed evenly among the treatment and control 
teachers for the pretest. For the posttest, each teacher was given the form which they had 
not taken in the pretest, 

Each form consists of ten exercises, each of which use a brief paragraph to describe a 
specific situation related to the day-to-day monitoring of student achievement in the 
classroom. Some exercises call for the construction of a particular form of assessment, such 
as a few items in a paper-and-pencil quiz or a structured observation plan. Others ask the 
teachers to describe a course of action they would recommend to solve the assessment 
problem presented. Still others ask for the expression and defense of an opinion about a 
day to day classroom assessment issue. 

The assessment developer identified the following six dimensions of competence in 
the monitoring of student achievement as the focus of the assessment: 

o Understanding of and ability to carry out the full range of uses of classroom 
assessment 

9 Understanding of achievement targets for students aad the ability to translate 
those into appropriate assessment methods 



61 



172 



• Ability to judge and maximize the quality of soundness of assessments 



© Understanding of and ability to use the full range of tools available for classroom 
assessment 

• Understanding of the role of assessment as a dynamic interpersonal activity 

• Ability to transform assessment results into sound feedback on performance 

The staff development consisted of six three-hour sessions occurring after school, 
arranged on two consecutive days in each of three months. The topics covered in the staff 
development, in the order presented, were: 

1) Understanding the meaning and importance of high-quality classroom 
assessment 

2) Measuring thinking skills in the classroom 

3) Constructing paper and pencil assessments for classroom use 

4) Using observation and judgement in classroom assessment 

5) Understanding standardized tests 

6) Developing sound grading practices 

Each topic is addressed by one or more exercises in the assessment. 

The assessment is scored through a comparison of the teacher responses to a 
predev nined set of correct answers. For some exercises, partial points are available for 
responset which exhibit some, but not all, characteristics of a response deemed to be 
complete. For other exercises, teachers need to provide only some of the possible responses, 
e.g., one positive feature of an assessment ap proach when four specific features are listed as 
correct. 



9 

ERIC 



5.2 



173 



Administration of Assessments 



This section on administration of the assessment contains an overview of the 
assessment administration, a description of the required logistics, a discussion of security 
needs, a description of the assessors and their training, and a brief description of teacher 
impressions of the administration. 

Overview 

The administration of the assessment occurred in Northern California. Participating 
teachers came from elementary schools in two districts. The pretests were given after 
school on either March 22 or March 26, 1990. Staff development activities were then 
conducted. Most posttests were given after school on either May 22 or May 23, 1990. Seven 
teachers could not come on the scheduled date for the posttest, so the assessment was 
administered to them on other dates by district administrators. 

As can be seen in Table 5.1, a total of 50 teachers participated in the pilot test, with 
33 in the group participating in staff development and 17 in the group which did not. 
Forty-six teacl 3re completed the posttest. The teachers were distributed almost evenly 
across the two forms of the assessment. 

Table 5.2 shows the characteristics of the teachers in the sample. The majority of 
the teachers in both the staff development group and the non-staff development group were 
women. There were four minority teachers in the group receiving staff development and 
two in the other group. Of the 39 teachers indicating which grade they taught, most taught 
in the intermediate grades (grades 3-6), although the teachers not participating in staff 
development were almost evenly split between the primary (K-3) and intermediate grades. 

Teachers were given as much time as they needed to complete the assessment. 
Times for completion ranged from less than an hour to two-and-a-half hours. The median 
was between an hour-and-a-half and an hour and forty-five minutes. 

Logistics 

Logistical arrangements included identifying a sample of teachers, administering the 
assessment (both before and after staff development), arranging for staff development, and 
acquiring evaluation feedback from participating teachers. 



6.3 



9 

ERIC 



174 



TABLE 5.1 



PARTICIPATION IN STAFF DEVELOPMENT ACTIVITIES BY 
PARTICIPATION IN PRE- OR POST-TESTS 

ASSESSMENT OF COMPETENCE IN MONITORING STUDENT 
ACHIEVEMENT IN THE CLASSROOM 



Teachers receiving staff 
development 

Teachers not receiving staff 
development 

Total number of teachers 
completing evaluation forms 



Total # of Teachers 



Number of Participating Teachers 



Pre-tesi 
Form A Form a 



17 



26 



50 



16 



8 



24 



Post-test 
Form A Form B 



15 



8 



23 



16 



23 



46 



5.4 



9 

ERIC 



17, 



TABLE 5.2 



PILOT TEST PARTICIPANTS 

ASSESSMENT OF COMPETENCE IN MONITORING STUDENT 
ACHIEVEMENT IN THE CLASSROOM 



Staff Development 
N«33 



Partfcfpatlon in Staff Development 
Activities 



Teachers Not Rcvq. 
Staff Development 



Gender 

Male 
Female 



Ethnicity 

Asian 

Black 

Hispanic 

Native American 

White 

No Response 



Grade Level 

K-2 
3-6 

No Response 



10 
23 



1 
2 
1 
0 
25 
4 



10 
19 
4 



4 
13 



1 
0 
0 
1 
8 
7 



4 
5 
8 



6J5 



Identifying teacher samplei. The inclusion of the staff development component in 
the pilot test design made it imperative that teachers be located within a relatively concise 
geographic area. Districts were contacted about providing groups of elementary teachers to 
participate. Two nearly neighboring districts agreed to encourage their teachers to 
participate in the staff development and to solicit a group of comparison teachers who would 
only participate in the assessment. (The comparison teachers were to receive the training at 
a later date.) Teachers were paid by their districts for their participation. 

Orientation material!, Teachers were given no formal orientation materials; 
information concerning the assessment and staff development activities was distributed by 
district administrators, and was chiefly limited to the topic, directions to the assessment and 
staff development sites : ar i dates. 

Conducting the assessment The assessment is designed for large-scale 
administration by a small number of test administrators, who distribute and collect 
materials and monitor the teachers. No special training or background in assessment is 
needed, as instructions are designed to be self-explanatory. 

The pretests began with a ten to fifteen minute overview of the research design, 
covering the following topics: (1) the purpose of the pilot test and descriptions of the spring 
pilot test activities; (2) identification of the assessment developer and distinctions between 
the roles of the assessment developer and FWL; and (3) the confidentiality and use of the 
results. Unfortunately, teacher Raveling from their individual schools arrived at different 
times, and several missed the overview at the pretest. At the posttest, teachers were 
allowed to begin as soon as they desired after arrival. TL« overview was omitted for the 
posttest. 

Conducting ■toff development The six three-hour staff development sessions were 
given over a two-month time period. They were scheduled for pairs of consecutive day3 
with approximately one month between each group of sessions. The sessions occurred late 
in the afternoon on school days and were scheduled to allow the teachers enough time to 
travel to the staff development site after students were dismissed from school. A series of 
sites approximately half-way between the two districts was located. 

Obtaining feedback from the teachers. Evaluation feedback from the teachers was 
collected through a survey immediately after the posttest. 



6.6 



9 

ERIC 



177 



Security 



Basic security precautions such as guarding copies of the assessment instruments 
and monitoring teachers for collaboration during the test were taken. It is possible that 
teachers taking different forms of the test discussed the exercises they completed during the 
period between the pre- and posttests. (The two forms were clearly distinguishable, as each 
was printed on a different colored paper.) Teachers completed the form which was not used 
for the pretest as a posttest. Since the teachers could have discussed their forms and 
questions with others, this is a potential source of biased results. However, there are 
sever J reasons to believe that the effect was minimal. First, two months elapsed between 
the pre* and posttests. Second, teachers were unaware that the same two forms would be 
used for the posttest. Finally, the assessment did not have any consequences for the 
teachers, so it is unlikely that they were motivated to make extensive efforts to learn what 
was on the other form. 

Exercises vary in the degree to which they would be susceptible to coaching or 
memorization of standardized answers so that a teacher cotud pass the assessment without 
understanding the underlying principles. Some exercises are more performance-based, e.g., 
construct three multiple-choice questions based on a given passage, and this type of exercise 
should be relatively robust to coaching effects since it depends on content that can be easily 
variftd. The responses to some other exercises consist of stating principles for constructing 
various type? of assessments which are relatively content free, and are vulnerable to the 
memorization of lists with little understanding of the principles or their application. 

Aseaf son and Their Trailing 

Two members of the FWL staff administered the assessment. No training was 
provided or deemed nsceesary. If statewide administration of an assessment of this type 
were contemplated, standardized guidelines for dealing with potential complications, (e.g., a 
teacher becoming ill during the test) would be needed. 

Teacher Impressions of Administration 

Over three-quarters (79%) or 33 of the 42 teachers completing an evaluation form 
believed that the arrangements for the assessment were reasonable. Those who disagreed 
cited the time of day (immediate 1 -/ after school) and/or distance from their school. Several 



6.7 



17S 



teachers commented that it was difficult for them to complete a day of teaching, take a two- 
hour assessment, and then fill out an evaluation form. The following is an example of this 
type of response: 

The pretest and posttest are n^t accurate reflections ot my 
ability. I was just too tired from teaching all day. I would have 
rather taken them in the morning and had an A.M. sub. 

Other teachers complained that the test was too long: 

Please note that the pre- and posttests themselves were extremely 
lengthy, tedious and time-consuming. A shorter version/format would 
have been appreciated. I frankly grew bored and tired from writing 
and am not sure my answers reflect the depth of my knowledge in 
many areas! 

The testing/pretesting and evaluation forms are unreasonable to 
expect teachers to fully answer. Too long, too much thought 
must go into each answer. You have to shorten your test or 
your participants will become extremely frustrated and 
disgruntled. I felt overwhelmed, but still answered fully. 
However, I don't expect others to follow suit. 

Comments about the length of the test were not as prevalent in the pilot tests that took as 
long or longer to complete, had similar feedback strategies, but were scheduled for 
Saturdays. 

Scoring 

The section on scoring describos the process used to score the instrument, the 
qualifications of scorers, the training of scorers, and their perceptions of that training. 

Scoring ; Process 

The scoring system consists of guidelines for each exercise which determine how 
many points should be awarded for a response. Total points possible vary from 3 to 8 
pointe for each of the ten exercises. Many of the exercises contributing a large number of 
points to the total score are composed of subparts, which are independently scored. The 

5.8 



17 i) 



ucoring guide also lists responses to be awarded intermediate points (e.g., one point instead 
of two for a specific subpart). These responses are deemed to be partially correct but 
incomplete. 

Lach scoring criterion identifies responses for each level of credit earned. For 
instance, in explaining a national stanine score of 4 in language expression, a teacher is 
awarded the maximum score of two points if the score is interpreted to mean that Helen 
outscored 20 to 40 percent of the norm group, 1 point if the response says that Helen scored 
slightly below the average stanine of 5, and 0 points for any other response. While this 
example has a single response defined as correct for each level of credit, for some exercises, 
multiple correct responses are ide* tified for each level of credit. The job of the scorer is to 
match the teacher response with the appropriate level of credit 

Some difficulty in scoring was experienced in that many of the responses described in 
the scoring guide were written in terms of technical language relating to assessment, using 
terms like reliability and validity. The beginning teachers responding to the exercises did 
nut tend to discuss any answer in technical terms, so it was often difficult to judge 
responses, as they looked little like the criterion responses For instance, one criterion 
response was "sample performance with a broad array of structured exercises or 
observations of naturally occurring events". A teacher response that was scored as correct 
read, "He could videotape tha students on various occasions.** 

One scorer described the problem as follows: 

I found the language in which the criteria were written to be quite off- 
putting. .The test-designers might consider how to translate technical 
jargon into common English. It might also help them to see when 
exercises may not assess what they think they may be assessing 

Scorers and Their Training 

Scorers. Six people completed the scoring training and scored the assessment. All 
six have experience in assessment development as well as some teaching experience. Two 
are FWL staff members; the remaining four are doctoral students in the field of education. 
Three scorers are former math teachers; one is a former science teacher; one is a former 
English teacher; and one has teaching experience at both elementary and secondary levels, 
principally in language arts. All four non-FWL scorers completed evaluation forms. 



5.9 

ISO 



The non-FWL Scorers differed in their opinion as to the level of knowledge about 
assessment needed to score the assessment. One scorer believed that only a few items 
required more technical knowledge of assessment than the average scorer not trained in 
assessment might possess. Another scorer believed that a good working knowledge of 
assessment and assessment terminology was needed. A third scorer believed that scorers 
needed to be highly knowledgeable about assessment and current issues in measurement. 
The fourth scorer had no opinion. 

Three of the four non-FWL scorers believed that teaching experience was needed \r 
score the assessment. One scorer stated, "The scorer's knowledge of teaching is more 
important than her knowledge of assessment. The scorer should be one who is accustomed 
to 'standing outside of teaching' and reflecting upon it." The fourth scorer did not believe 
that much knowledge of teaching was needed to score the assessment, although experience 
in teaching English or reading would be useful for scoring two of the exercises. 

Training of scorers. The scoring criteria were designed to require minimal training 
to score the instrument. The developer of the assessment instrument conducted the 
training of the ocorers, which took about three hours and covered both forms. An 
experienced teacher completed both versions of the assessment to provide a set of sample 
responses distinct from those ot the beginning teachers to be scored. The training consisted 
of the trainer reading the prompt, the sample response, and the scoring criteria. He then 
explained how the scoring criteria should be applied to that response, and asked for any 
questions from the scorers, which often sparked discussions of how to apply the scoring 
criteria. This process was repeated sequentially for each exercise. 

There was no provision for independent practice in scoring and monitoring of 
performance before the scorers began evaluating actual teacher responses. Some of this 
occurred informally, in the form of individual conversations and informal group discussions 
about how to apply various scoring criteria. 

Only one example of a teacher response for each exercise was used to demonstrate 
the application of tha scoring criteria. Furthermore, the example was from an experienced 
teacher who had attended a graduate-level course taught by the assessment developer. 
Consequently, the sample response was much more lengthy and frequently used the 
technical language of the scoring criteria, unlike the responses of the beginning teachers, 
which tended to be brief, and couched in general terns. The training would have been 
strengthened considerably by the demonstration of both a greater number of examples of 
scored responses and by providing examples which more closely resemble those to be scored. 

5.10 

ERIC " 1 



Perception! of training. All four scorers who were not FWL staff completed 
evaluation forms, which included questions on the training. All four of the scorers rated the 
training as "adequate," the intermediate rating provided. However, one scorer added the 
qualifier "barely," and another noted that "I did not feel sufficiently confident that I was 
applying the scoring criteria like other scorers." The scorers agreed that the most useful 
part of the training was the discussion of specific examples of applications of the scoring 
criteria. Two scorers recommend adding this to the training. The other two recommended 
more extensive preparation for scoring, either by taking the assessment prior to scoring it to 
learn its contents or by reviewing materials sent in advance of the training which explain 
"what the test is designed to do, who it tests, under what conditions was the testing done, 
etc." 

Assessment Content 

This assessment differs from the others pilot tested for the California New Teacher 
Project in that it does not focus on a single subject matter, but on a teaching competency 
which cuts across subjects: monitoring the achievement of students in the classroom. 

The importance of the topic of classroom assessment is supported by an informal 
survey of the teachers attending the initial staff development session. When asked if they 
had leceived any training in assessment during their teacher preparation, only a few 
responded affirmatively. A growing number of regional and national surveys indicates that 
it is typical for teachers to lack training in assessment. Yet all teachers are expected to 
employ both formal and informal assessment techniques and to make judgements about 
students. 

The assessment developer has spent the last decade in the designing, testing, and 
redesigning of instruction in assessment for classroom teachers, and is a nationally- 
recognized authority in this area. The topics selected for the staff development component 
of the assessment are a subset of the instructional modules he has developed. This 
assessment is intended to test knowledge and application of principles for sound 
construction and proper use of student assessments that are based on both research and 
teacher feedback. 



6.11 

2. \^ ^ 



In the following pages, the content of the Assessment of Competence in Monitoring 
Student Achievement in the Classroom is evaluated along these dimensions: 

» Congruence with various curriculum frameworks addressing curriculum in 
the elementary grades; 

• Extent of coverage of California Standards for Beginning Teachers; 

• Job-relatedness of the instrument; 

• Appropriateness for beginning teachers; 

• Appropriateness across different teaching contexts (e.g., grade levels, 
subject areas); 

• Fairness across groups of teachers (e.g., ethnic groups, gender); and 

• Appropriateness as a method of assessment. 

As was true of all of the assessment instruments pilot tested this spring and summer, 
there was not sufficient time during development to conduct a larger content validity study. 
Without such a study, our ability to comment on the assessment's appropriateness along 
such dimensions as job-relatedness, appropriateness for beginning teachers, and 
appropriateness across contexts is limited. Thus, excluding the first two dimensions of 
curriculum congruence and standards coverage (which are based on FWL staffs analysis of 
the documents involved), the discussions of the remaining dimensions are based on the 
perspective of the participating teachers and scorers, and of FWL staff, as reflected in 
feedback forms, in informal conversations with the scorers, and in analysis of the scores. 

Congruence with California Model Curriculum Guides and Framework! 

The following discussion of the content begins with a compar ison of the assessment 
instruments with the model curriculum guides. This assessment emphasizes knowledge of 
principles of valid assessment which pertain to every curricular area. Assessment is not 
typically addressed at length in the curriculum guides and frameworks, which mainly focus 
on curriculum content. However, there are aspects of nearly every curriculum guide and 
framework which address evaluation of student progress. The particular instruments pilot 



5.12 



9 

ERIC 



1S3 



tested included references to evaluation of student progress in four subject matters: 
language arts, science, social science, and mathematics. 

Language arts. Three exercises portray assessment in the subject of elementary 
language arts. One exercise on each form addresses writing assessment, and an additional 
exercise on one form addresses the assessment of reading. One of the writing assessment 
exercises asks a teacher to give a student written feedback on a writing sanr'e for several 
features chosen by the candidate (excluding mechanics). The focus of this exercise is on the 
teacher's ability to devise appropriate criteria, apply them, and explain their evaluation to a 
student in a way that provides useful feedback. The writing sample used for this exercise is 
an account of a student's friendship over time. This is consistent with the emphasis in the 
English-Language Arts Model Curriculum Guide: Kindergarten through Grade Eight 
(California State Department of Education, 1988) on basing instruction on students' 
experiences. 

The second writing assessment exercise, on one form only, asks the teacher to list 
some of the features of a writing assessment designed to provide diagnostic information and 
show change over time. However, some of the acceptable responses for this exercise are 
consistent neither with current research on writing nor with the emphasis on basing 
instruction on students' experiences in the English-Language Arts Model Curriculum Guide. 
In this exercise, two scoring criteria listing features of a writing assessment ("sampling with 
sound writing prompts" and "keeping the prompts constant over time") both suggest that the 
topic for writing comes from the teacher. One of the key recommendations in writing 
instruction is the importance of writer-generated topics (Graves, 1983). Another scoring 
criterion for the same exercise, which requires concealment of the writer's identity to avoid 
bias, also runs counter to writing research which stresses the influence that background 
knowledge plays in a reader's or writer's construction of meaning. Familiarity with this 
background knowledge is deemed necessary for effective evaluation of student work. The 
preselected topic and concealed identity of the writer might make more sense for large scale 
program assessment, but it is less appropriate for classroom assessment. 

The third exercise, on one form oiJy, portrays three approaches to evaluating 
reading ability. Teachers are asked to identify positive and negative feat ires of all three 
approaches. One approach measures good reading by the ability to answer paper-and-pencil 
questions about what was read, i.e., the ability to construct meaning from a text. Another 
approach tests for the ability to read fluently, i.e., sound out words, and the third examines 
the ability of readers to retell the meaning of what was read in their own words with 
fidelity to the true meaning, i.e., reproduce the message in the text. 



To the extent that the first approach linked with the completion of narrowly- 
focused worksheets, it conflicts with the emphasis in the relevant Model Curriculum Guide 
cited previously. The Model Curriculum Guide emphasizes that students should be 
encouraged to actively interpret texts and that such interpretations vary from student to 
student, as each student constructs meaning based on their own set of background 
experiences. The second approach would not be considered by most reading experts to be a 
good single measure of reading ability, and is not mentioned in the English-Language Arts 
Model Curriculum Guide. Research on reading shows that while the ability to sound out 
words is often associated with the ability to construct meaning from the text, sometimes 
readers can sound out words without understanding the meaning. The third approach is 
consistent with the emphasis in the English-Language Arts Model Curriculum Guide on an 
interrelated program of listening, speaking, reading, and writing. To the extent that it 
implies that meaning inheres in the text independently of the reader, however, it contacts 
with the English-Language Arts Curriculum Guide. 

Science. Four exercises across the two forms use science as the focal subject of 
assessment. These four exercises constitute two sets of parallel exercises on each form. 
One set asks teachers to list principles for evaluation of the potential of either multiple- 
choice unit tests provided with textbooks or laboratory activities as assessment instruments. 
The other set asks teachers to construct multiple-choice items testing both recall and higher 
order thinking skills based on a given passage from a science textbook. 

The latest science framework available is the Science Framework for California 
Public Schools, Kindergarten Through Grade Twelve (California State Department of 
Education, 1990), which was released after this assessment was developed. As the most up- 
to-date statement of desirable content and framework of the California science curriculum, 
however, it is the standard to which the assessment exercises using science content are to be 
compared. The Science Framework supports the goal of increasing time devoted to hands- 
on activities in science classes to at least 40 percent of the total time devoted to teaching 
science. The exploration of laboratory activities as tools of assessment could be useful in 
measuring student achievement during hands-on activities. 

Another emphasis i' The Science Framework is tto teaching of science in depth 
rather than superficially. T*.e evaluation of multiple-choice tests to assess content 
knowledge gained through stud' T of a unit, is a valid assessment approach to consider. 
However, the exercise would he more congruent with The F .ence Framework if it v*ere 
amended slightly to make clear that the textbook was accompanied by a series of laboratory 
exercises, and the unit tests measure learning from both sources. 

6.14 



The design of multiple-choice testa to assess higher-order thinking skills was one of 
the staff development activities. However, the excerpts from science textbooks on which the 
s tudents were to be tested were too brief and superficial to support higher-order thinking 
without making many assumptions about the background knowledge of students. 
Furthermore, the testing of knowledge based on a brief passage conflicts with the emphasis 
in The Science Framework on in-depth knowledge gained at least partially through 
observation or experimentation which goes well beyond information presented in a textbook. 
An alternative which mignt be more congruent with The Science Framework would be to 
provide brief descriptions of a small series of laboratory activities together with the 
scientific principles to be inferred or reinforced, and to ask the teachers to construct 
multiple-choice questions to test mastery of these principles. 

Social science. There were two exercises which focussed on social science content, 
one on each form. One exercise asked teachers to compare two different assessment 
approaches: an oral res, onse comparing two countries on a given dimension and an item on 
a written test comparing the two countries. As with other curriculum frameworks, The 
History-Social Science Framework (California State Department of Education, 1988) 
emphasizes in-depth understanding of topics as opposed to more superficial knowledge. It 
also emphasizes understanding the significance of characteristics of governments or 
countries. The task portrayed in this exercise closely resembles memorization of isolated 
facts, and could be revised to portray an activity more congruent with the current 
framework, such as comparing the factors contributing to the evolution of Mesopotamia, 
Egypt and China as societies (California State Department of Education, 19S8: 61). 

The other exercise with social studies content asked teachers to suggest alternative 
way o of assessing limited-English-proficient students who might not understand the 
multiple-choice, true/false, and fill-in the-blank items in the unit tests provided with the 
textbook. The History-Social Science Framework calls for more than the assessment of 
student progress in learning knowledge. Additional goals recommended are: (1) the 
assessment c : basic skills and abilities, including those of thinking and social participation; 
(2) the utilization of a variety of evaluative techniques, including the teacher's evaluation of 
the students' performance, students' evaluation of personal progress, and peer evaluation; 
and (3) opportunities for students to make oral and written reports in which they are 
encouraged to state a position and support it. While carefully constructed paper-and-pencil 
tests such as those referred to in the exercise can measure some higher order thinking skills 
with respect to content knowledge, it is more likely that other forms of assessment would be 
needed that would likely be more appropriate for both limited-English-proficient students 
and other students as well, given the additional recommended evaluation goals in the area 



5.16 

18G 



of social science. For instance, teachers might be asked to design a performance-based 
assessment to meet one of the goals described above for a heterogeneous classroom that 
included several limited-English-proficient students. 

Mathematics. Only one exercise on one form focused directly on assessment in 
mathematics, and that exercise portrayed a classroom interchange between the teacher and 
several students in which the content plays a relatively minor role. The Mathematics 
Framework (1985) emphasizes problem solving and an increase in the use of cooperative 
learning groups. Some aspects of assessment relating to these emphases were addressed by 
the exercises, though using other subjects, such as conducting observations and performance 
assessments and constructing multiple-choice questions to measure higher-order thinking 
skills. However, problem solving and the use of cooperative learning groups, though not 
peculiar to mathematics, each pose problems for assessment of student achievement which 
are not explored by this assessment. How would one measure problem-solving ability, 
especially when the type of problem solving advocated in The Mathematics Framework 
includes encouraging students to follow incorrect strategies to learn for themselves how to 
determine when a strategy is not working? How do you individually evaluate students when 
they are engaged in a group activity? Neither of these assessment dilemmas wei<* addressed 
by the assessment. 

Examination of specific subject-matter content suggests that modifications of the 
exercises are necessary to bring tbe assessment into closer congruence with current 
curriculum guides and frameworks with respect to the content represented in the prompt 
materials. With respect to the variety of assessment approaches represented in the 
exercises, the assessment addresses knowledge of how to construct effective performance 
assessments, observation protocols, and paper-and-pencil assessments. As can be seen from 
this list, most if not all of the assessment methoda which might be used to assess the more 
in-depth knowledge called for in the latest curriculum guides and frameworks aja 
represented in the current collection of exercises. Sometimes the exercises stop short of 
measuring a teacher's performance in constructing and using these assessment approaches, 
testing instead knowledge of general principles of assessment design which the candidate 
may or may not be able to apply correctly. Eliciting more performance-based responses 
from the teachers or asking teachers what conclusions they would draw from different 
assessments that vary in soundness of design would require more direct demonstration of 
assessment skills. 



5. 1 8 



187 



Extant of Coverage of California Standard* for Beginning Teachers 

The California Beginning Teacher Standards are criteria for teacher competence and 
performance which the Commission on Teacher Credentialing expects graduates of 
California teacher preparation programs to meet. The usual practice in evaluating 
assessments pilot tested is to consider the stimulus materials and scoring criteria in light of 
each standard. However, this assessment was narrowly focussed on a single area of teacher 
competence: measuring student achievement in the classroom. Therefore, it will be 
discussed with respect to the only applicable standard, Standard 27, the text of which is 
printed below in italics. 

Stmdard27: Student Diagnosis, Achievement and Evaluation. Each candidate 
identifies students 9 prior attainments, achieves significant instructional objectives, and 
evaluates the achievements of the students in a class. 

Although evaluating a teacher's ability to use assessment techniques for student 
diagnosis and evaluation is the goal of the assessment instruments, the exercises are uneven 
in their ability to accomplish this goal. Some of the exercises (e.g., ono which arks teachers 
to write multiple-choice items to test comprehension of a paragraph of text provided) 
require teachers to apply their knowledge of general principles related to assessment to 
specific situations. Other exercises, though utilizing classroom-related problems, only ask 
the teacher to respond in terms of general principles for constructing a valid assessment, 
and do not determine whether or not the teacher can apply these principles. 

One example of a missed opportunity to test application of assessment-related 
knowledge is the exercise where teachers interpret standardized test scores. With one 
exception, the exercise focuses on evaluating the technical accuracy of the explanations o r 
various scores. The exception is where the teachers are asked if they should use a grade- 
equivalent score to choose the level of work for a child. Teachers could have foen asked 
how they would respond to questions from parents about the meaning of the scores, and 
judged on whether they could communicate the appropriate meaning. 

The major focus of the exercises is on assessment issues, sometimes to the exclusion 
of other issues in a complex situation with many competing goals. Teachers rarely use such 
a singular focus to analyze problems. The exercises vary in the extent to which they 
accommodate competing concerns which might lead a teacher to use less effective 
assessment techniques to achieve other goals. For example, one set of exercises asks 
teachers to label assessment practices either "sound" or "unsound" and explain their position. 

5.17 



ISS 



The criteria for scoring one subpart allows a practice which is otherwise sound from a strict 
measurement point of view to be labeled unsound due to the effect on the self esteem of 
students. In contrast, other exercises, e.g., an exercise on how a teacher handles cheating, 
require teachers to focus solely on measurement effects and disregard any other effects, e.g., 
penalties for cheating that may affect measurement of achievement but serve as a powerful 
negative reinforcement discouraging future cheating. (Several scorers also commented thet 
many schools and districts have an official policy on penalties for cheating which teachers 
are required to follow, and that it would be difficult for beginning teachers to contradict 
local policy.) If a teacher is directed to evaluate a decision which has potentially negative 
effects on goalfi unrelated to assessment, then the iristructions should instruct the teacher to 
consider the effects on assessment alone as an initial step toward identifying the 
consequences of that particular decision. 

Some of the scoring criteria are debatable even in measurement terms, e.g., one 
exercise which asks teachers to judge whether dropping a student's lowest grade is a sound 
or unsound method of assessing student achievement. One line of thought, reflected in the 
current scoring criteria for the highest value, would argue that dropping a grade reduces the 
scope of the assessments or amount of information over which try ident is evaluated, thus 
reducing the validity and reliability of the assessment. .Another hue of thought holds that 
reliability is increased when outliers are eliminated, so the highest and lowest scores could 
bo dropped, And, again, measurement of noninstructional objectives is not addressed in this 
current version. For example, the motivational objective of deleting or substituting for a 
low score is not acceptable within the current scoring guide. 

Because of the problems described above with the exercises developed, FWL staff 
judge this assessment to only partially cover Standard 27 of the Beginning Teacher 
Standards. 

* 

Job-Relate dne*g 

Both teachers and scorers were asked whether the scenarios chosen were relevant to 
a teacher's job of monitoring student achievement. 

Teacher perceptions. A total of 42 teachers completed an evaluation form responding 
to various questions about the assessment. While all teachers completing the form had 
taken both forms of the assessment as either a pre- or posttest, half of the teachers had 
taken Form A as the posttest, and half Form B. It is likely that the form used as the 



5.18 



159 



po8ttest was the major influence on their response, as the pretest had occurred two months 
previously. Therefore, the results are reported separately by form. 

When the teachers were asked if they felt "the scenarios chosen for this assessment 
are relevant to your job of monitoring student achievement," 71% (15 of 21) of the teachers 
completing form A as the posttest and 67% (14 of 21) completing form B responded 
affirmatively. Those who agreed did not elaborate on why they believed the assessment to 
be relevant. Those who disagreed gave several reasons, including type of students taught 
and grade level. The following comments are illustrative: 

Most of my students (85% are Chapter 1 students) they need hands on 
teacher directed assessments. Rarely do I use a standardized test or 
essay, more of a discussion and an assignment. 

As a kindergarten teacher it was a bit of a reach to answer the upper 
grade questions. There were none at my level of teaching experiences. 

Scorer perceptions. All four scorers believed that the assessment was relevant for 
teachers. "Most of the items present situations commonly confronted by beginning 
teachers," commented one scorer. However, even while agreeing that the assessment was, 
on the whole, relevant, one scorer expressed reservations about a few exercises, and another 
was not certain that the assessment was relevant for kindergarten teachers. 

Generally, both teachers and scorers believed that the scenarios represented in the 
assessment were reli/ant to the job of elementary teaching. 

Appropriateness for Beginning Teachers 

The appropriateness for beginning teachers was assessed through both surveys of 
teachers and scorers and through analysis of the teacher scores. 

IWher perceptions . 1 ->achers were asked if, as new teachers, they felt that they 
had "sufficient opportunity to acquire the knowledge and abilities needed to respond in a 
reasonable manner to the assessment questions." Sixty-seven percent (14 of 21) of the 
teachers completing form A as a posttest and 48% (10 of 21) of the teachers completing form 
B believed that they had such opportunities. Roughly half of those who disagreed that the 
test was appropriate for beginning teachers received the staff development training, and 

5.19 



half did not. Those who did not attend the training.generally cited their lack of instruction 
in assessment, as in the following comments: 



I have received no training in assessment Bloom's taxonomy of types of 
questions has been the only information I have received concerning 
assessment. 

Not one of my education courses covered the rationale behind test 
procedures (I have & Masters in Instructional Leadership). 

Those who did participate in the staff development workshops gave several reasons 
for their belief that their preparation had been inadequate, including a need for more 
classroom experience, and a desire for more practice in application of the principles learned. 
The range of comments is reflected in the following set: 

It takes a very long time to practice being reasonably confident in 
assessment. 

Several questions require experience as well as a theoretical application. 
Also even though certain practices were covered in course work such as 
testing evaluation, they were not covered in depth with practical 
application. 

I need time and practice. I liked the workshops but I need more 
hands-on. I learn by doing. 

The issue of appropriateness for beginning teachers was also explored through 
another question which asked the teachers if they found any questions on the posttest too 
difficult. Sixty-seven (14 of 21) of the teachers taking form A and 43% (9 of 21) of the 
teachers taking form B reported that there were questions they found too difficult. Table 
5.3 shows the topic of the exercises the teachers reported finding difficult. The largest 
number of teachers reported difficulty with interpretation of standardized test scores, 
particularly with explaining a stanine score. The next largest numbers had difficulty with 
writing assessment and constructing multiple-choice items. 



5.20 

ERIC JMl 



TABLE 5.3 



TOPICS OF EXERCISES REPORTED BEING TOO DIFFICULT 

ASSESSMENT OF COMPETENCE IN MONITORING 
STUDENT ACHIEVEMENT IN THE CLASSROOM 



^ 


jNji^rof Teacr^ra Fteporting 
Topic Difficult 


Infprnrpffltinn rtf Qtanriarfli7Pri 

test scores 


10 


Writing assessment 


5 


Constructing multiple-choice items 


4 


Calculating composite test scores 


3 


Assessing reading 


1 


Assessment during instruction 


1 


Constructing an observation assessment 


1 


Effect of penalty for cheating on soundness 
of assessment 


1 



6.21 



192 



Scorer perceptiom. Stating a consensus that most teachers do not presently receive 
sufficient instruction in monitoring student achievement, none of the scorers believed that 
typical beginning teachers have the knowledge necessary to perform well on the assessment. 
However, as one scorer put it, "These are. issues that new teachers cquH be taught to 
address. They are not necessarily dependent on the amount of experience that one has with 
students." The scorers see this as a topic that is not beyond the grasp of beginning teachers 
if they were to receive more instruction in classroom assessment. 

Perfor mance on assessment tasks. Table 5.4 summarizes the performance of 
teachers participating or rot participating in staff development. (Since each teacher was 
scored by two scorers, both scores were summed to arrive at a final score.) Teachers taking 
Form A as a pretest took form B as a posttest, and vice versa. The highest score possible 
on each form was 106. As the table indicates, the majority of scores are well under 106, 
suggesting that the teachers were not sufficiently prepared in classroom assessment to do 
well on this instrument. 

Teachers as a group did particularly well on the exercise which asked them to 
identify alternative ways to assess LEP students and on the exercise analyzing a teacher's 
verbal responses to students. They did particularly poorly on constructing multiple-choice 
questions, suggesting ways to accurately assess students whose cultural norm, interfered 
with common ways of assessing student performance, interpreting standardized test scores, 
and evaluating specific assessment practices, including a proposed penalty for cheating. 

These results do not provide evidence that the staff development had a strong impact 
on teacher performances on the assessments. The assessment developer has conducted 
training of similar content in settings where experienced teachers participated in the 
training in either university courses oi district inservice programs. Possible explanations 
for the absence of clear training effects in thf; j. 'lot test include (1) given that these were 
new teachers, inexperienced in not only student assessment but also in conducting their own 
classrooms, the format of one-session-to-a-topic staff development workshops conducted 
after school was probably insufficient to completely address such a complex topic in a 
manner that the beginning teachers could absorb, (2) training where teachers self select an 
area, which was not the case here, will result in greater impacts than when there is not an 
explicit commitment by the teacher participants, and (3) the time period, after school 
setting, and related fact that teachers did not necessarily practice or apply the concepts 
contained in training sessions would reduce the effectiveness of the training. (In university 
and district inservice sessions the teachers are given and complete assignments that build 
on the training provided. The pilot test did not have this feature.) 



522 



TABLE 5.4 



TEACHER PERFORMANCE BY FORM, PRE- OR POST-TECT, 
AND PARTICIPATION IN STAFF DEVELOPMENT 

ASSESSMENT OF COMPETENCE IN MONITORING STUDENT 
ACHIEVEMENT IN THE CLASSROOM 





Form A 


Form B 




Pretest : 


•/.Posttesti:;: 


pretest 


posttest 


Teachers participating in 










staff development 










Mean 


39.82 


44.93 


45.06 


47.13 


Standard Deviation 


11.95 


16.32 


11.97 


12.23 


N 


17 


15 


16 


16 


Teachers not participating 










in staff development 










Mean 


43.56 


41.86 


43.50 


49.50 


Standard Deviation 


17.35 


14.31 


8.75 


10.01 


N 


9 


7 


8 


8 



6.23 

I'M 



Appropriateness across Contexts 



The contexts explored on the evaluation form included both differing grade levels 
aiid differing types of students. 

Grade level and subject matter. Table 5.5 shows the grade levels represented in the 
set of exercises for form A and form B. Both forms, but particularly form B, were more 
weighted toward questions focusing on the highest grades covered by the elementary 
credential. 

Teachers were asked directly if they felt that "the assessment was appropriate for 
different teachers teaching grade levels from kindergarten to eighth grade." Forty-three 
percent of the teachers (9 of the 21 teachers taking each form as a posttest) believed that it 
was. Teachers who did not believe it was appropriate made the following comments: 

J did not feel comfortable answering exercise 9 [design of an 
observation assessment to determine whether or not a student should 
be retained in kindergarten] since 1m not familiar with kindergarten. 

This assessment does not relate to my daily events and situations - 1 
was asked to be a junior high teacher, a fifth grade teacher, seventh 
grade teacher, a sixth grade teacher, but only once asked to be a 
kindergarten teacher (which I am). Only half of the questions were 
circumstantially general enough for me to be comfortable answering. 

The questions were geared toward the middle school grades. 

This last set of comments echoes similar comments on other assessments, and 
illustrates a paradox that teacher assessments for licensure must address. On the one hand, 
teachers are licensed to teach multiple grade levels, so situations portraying various grade 
levels should be represented in the stimulus materials to which teachers are asked to 
respond. On the other hand, the rationale for delaying administration of performance-based 
assessments until the early years of teaching is to focus on competencies which take may 
some unspecified amount of independent teaching experience to develop. The experience of 
beginning teachers, however, typically only covers a limited number of grades. Kow to 



5.24 



ERIC 



I 

I 

■ TABLE 5.5 

DISTRIBUTION OF SCENARIOS IN EXERCISES ACROSS GRADE LEVELS 



ASSESSMENT OF COMPETENCE IN MONITORING 
STUDENT ACHIEVEMENT IN THE CLASSROOM 



Grade Level 


Number of Exercises 


Form A 


Form B . 


K-2 


2 


0 


3-5 


1 


4 


6-8 


5 


4 


Unspecified 


2 


2 



I 5.25 

O 

ERIC 



design assessments so that, a teacher can draw upon their teaching experience yet sample 
the entire range of grade levels covered by a credential is a problem that remains to be 
solved, particularly by assessments of elementary teachers where not only multiple grade 
levels but also multiple subjects are included. 

Scorers were not '^d to comment on fairness across grade levels. 

Diverse students. Another dimension of the appropriateness of the assessment 
across contexts is fairness across differing groups of students taught. California students 
are increasingly diverse. Only four of the 42 teachers responding to the evaluation survey 
reported that none of their students spoke any language other than English. Six teachers 
reported that their students collectively spoke five or more languages. (It should be noted 
that this does not necessarily mean that students who spoke a language other than English 
did not speak English fluently, only that California classrooms are increasingly diverse in 
terms of student cultures.) 

The assessment was consciously designed to represent a variety of students, and to 
place members of gender and racial/ethnic groups in non-traditional roles whenever possible. 
Examples of the latter are the portrayal of a man as a kindergarten teacher and a black 
female as the highest scorer on a series of tests. The exercises were examined by the 
development team to verify that both positive behaviors and negative behaviors portrayed in 
the exercises were distributed across students of various backgrounds. While most exercises 
did not discuss the classroom composition, the names used to identify specific students were 
characteristic of a variety of ethnic groups. One exercise on both forms addressed the 
assessment of students whose cultural norms might interfere with traditional assessment 
measures. Another exercise on one form dealt with the assessment of limited-English- 
proficient students. A third exercise addressed the design of an observation assessment to 
identify students who should be retained in kindergarten. 

Teachers were asked if they believed the assessment was "appropriate for teachers of 
diverse student groups (e.g., different student ability levels, different ethnic groups, 
handicapped or limited English students, different school/community settings)." They 
overwhelmingly believed that it was, with 90% (or 19 of the 21 teachers completing each 
form) replying that it was appropriate. The one teacher disagreeing who elaborated on the 
choice cited the exercises on paper/pencil examinations, explaining that Limited-English- 
proficient students did not take such examinations. 



6.26 



197 



Scorers were asked to comment on the appropriateness of the assessment to address 
"the ability of the new teacher to work with diverse student oups." Their responses were 
mixed. One scorer believed that it was appropriate, while the other three expressed varying 
degrees of reservation. One scorer praised the exercise which asks teachers to suggest 
alternatives to paper-and-pencil tests to assess limited-Engiish-proficient students, but 
believed that another exercise focusing directly on how to avoid cultural norms interfering 
with assessment practices needed to be "overhauled." Another scorer recommended either a 
greater emphasis on cultural diversity or heavily weighting the same exercise addressing 
cultural norms which the previous scorer criticized. The remaining scorer was "not sure 
that either form adequately addressed the issue of diversity" believing that the exercise on 
cultural norms was "presented so badly that candidates didn't have much of a chance to 
display their knowledge about how to deal with a multicultural classroom." In informal 
discussions, the scorers pointed out that many of the teachers discussed the examples in 
terms of psychological rather than cultural explanations (e.g., the student who did not wish 
to draw attention to him/herself was perceived as being shy rather than as coining from a 
culture which stressed communal rather than individual achievement), and several stated 
their belief that the examples provided misled the teachers. 

Scorers were also asked their opinion of the suitability of the assessment for new 
teachers in different school and community settings, Two scorers believed that it is "general 
enough" to be used for teachers in different settings, although one qualified the answer by 
suggesting additional scoring criteria for one exercise. The additional criteria accepted a 
currently ineligible response if qualified due to an inadequate supply of materials or an 
extremely large class size. The other two scorers expressed reservations due to differing 
philosophies of assessment or policies regarding assessment practices which they did not see 
equally reflected in the scoring criteria. 

To further explore the freedom from bias against teachers from particular teaching 
contexts, the assessment developer's description of the bias review, Form B, and the 
relevant scoring guide were sent to Dr. Sharon Nelson-Barber, an assistant professor of 
anthropology at Stanford University, for review, Dr. Nelson-Barber is a consultant who 
works with school districts and teachers of homogeneous minority classrooms. 

Dr. Nelson-Barber praised the portrayal of equity and diversity in the exercises, but 
she expressed concern about the exercises 1 ability to do the following: 



5.27 

11)3 



t 
( 



• accommodate alternative conceptions of teaching 



• account for the assumptions about context and preferred teaching practices that 
guide candidate responses and scorer ratings 

• accommodate answers that might be deemed correct by the scorer but do not 
appear in the scoring guide 

The first two concerns do not only apply to this assessment, but to all the 
assessments which Dr. Nelson-Barber reviewed. Her concern about the ability of the 
assessment to accommodate different conceptions of teaching centered around the fear that 
the instructional techniques and interactive behaviors deemed effective in minority 
communities would not be seen as appropriate. Dr. Nelson-Barber points out that there are 
many ways to be a good teacher, and she believes that teachers should be evaluated with 
regard to the ways that they are trying to use. 

Both the teachers responding to the assessment exercises and scorers have 
philosophies of teaching and preferred teaching practices which undoubtedly influence their 
judgements. Teachers in this assessment are not permitted an opportunity to communicate 
assumptions about the teaching context in which the incidents portrayed in the exercises 
occur and how the context affects the reasoning that underlies a particular response. Dr. 
Nelson-Barber has reservations about scoring a response as correct or incorrect without 
understanding the contextualized reasoning which led to the response. For example, the 
exercise designing an evaluation of a writing sample explicitly tells the teacher to exclude 
mechanics, yet many inner city parents and teachers view mechanics (e.g., grammar, 
spelling) as extremely important for their students to master in order to successfully 
compete for good jobs. 

FWL staff note that there is little contextual information about the classrooms or 
districts portrayed : n the exercises, so it is likely that teachers draw from their own 
experience in answering the questions. Teachers of classrooms which require certain 
assessment strategies, e.g., a large number of ESL students who have difficulty reading 
English, may find exercises difficult which focus on assessment methods which are not 
suitable for their students. 

Since Dr. Nelson-Barber is not an expert in assessment, her ability to comment on 
the appropriateness of the range of acceptable responses reflected in the scoring guide was 
limited. Based on her general experience working in diverse classrooms, she believed that 

6.28 



reasonable limits had been set for the scoring criteria; however, she expressed discomfort 
with the fixed nature of the acceptable responses whirh does not allow scorer discretion to 
give credit for acceptable responses that do not appear on the scoring guide. 

Fairness across Groups of Teachers 

The teachers were asked if they believed that the assessment was fair "to new 
teachers of both genders, different ethnic groups, different language groups, and other 
groups of new teachers. Nearly all of the teachers believed that the assessment was fai .<*, 
with 90% (19 of 21) of tY ^e completing form A as a posttest and 95% (20 of 21) of those 
completing form B agreeing that the assessment was fair. 

The scorers responded to a similar question on their evaluation form. Two scorers 
commented that the assessment seemed to favor teachers with strong written, as opposed to 
oral, communication skills. One of these scorers also commented that teachers' subject 
matter backgrounds could either give them an advantage or serve as a distraction for 
different exercises. A third scow believed that teachers with limited proficiency in written 
English were handicapped, especially in writing multiple-choice items. (However, an 
implicit emphasis on English proficiency is probably appropriate, unless a teacher is a 
bilingual teacher primarily instructing in another language.) The fourth scorer qualified the 
answer as follows: "If the expectation is that all teachers should be able to take and pass 
pencil and paper tests and that all new teachers have been taught the Mingo* of 
psychometrically valid assessment, then it is a fair test." (This scorer had expressed the 
belief that the exercises did not match what good teachers actually do in their classrooms to 
monitor student achievement, and that people with the appropriate vocabulary and mastery 
of abstract concepts could do well on the assessment without being able to actually apply 
any of the principles they could describe. If the scorer's belief is true, this could be 
addressed by the inclusion of more performance-based items.) 

Appropriateness as a Method of Assessment 

Teachers were asked to comment on the appropriateness of the assessment 
methodology in two ways: by agreeing or disagreeing that they believed the assessment was 
an appropriate way to assess competency in monitoring student achievement and also by 
comparing the assessment method with other assessments with which they had been 
evaluated, with the CBEST, the NTE, and observations during student teaching given as 
examples. 



629 

2u0 



Teacher perception!. Thirty-eight percent (8 of 21) of the teachers completing form 
A as a posttest and 48% (10 of 21) of the teachers completing form B believed that the 
assessment was an appropriate way of assessing their competency in evaluating student 
achievement. Those agreeing that it was appropriate gave responses and suggestions such 
as the following: 

It makes you think through and explain your positions on assessment. 

But an "open-book" format would be more useful and accurate - showing 
how the competency is used, not memorized. 

Teachers who disagreed questioned the validity of the assessment: 

Even if I answered everything correctly, it doesn't necessarily mean I 
have bought into the values expressed and have initiated them in my 
class. (In my case, however, I am doing my best to improve). 

The real test is how we assess in our classrooms on real kids that we 
know and understand. Many cfthe exercises do not apply to a primary 
teacher, such as test scores, and multiple choice test items. 

I think a better way would be to look over my grade book, tests used, and 
have someone offer practical suggestions as to my method of assessment. 
(How much should a test weigh compared to daily work? Should curve or 
straight percent be used? How do you grade a writing sample of a RSP 
student in a regular reading class compared to others?) 

Scorer perception!. Scorers were asked to describe the strengths and weaknesses of 
the assessment. Three of the scorers praised the topic of the assessment. Two scorers saw 
the ease of administration as a strength. One scorer cited the relative ease of application of 
scoring criteria. One scorer summarized the strengths of the assessment as follows: 

J find the general thrust of this assessment instrument valuable. It asks 
teachers to justify their approaches to assessment in specific contexts 
commonly confronted in the classroom, and to perform basic tasks related 
to assessment that all teachers should be able to do. I appreciate the 
inclusion of questions pertaining to culturally-based beliefs, and the 
assessment of writing. 

5.30 



ERIC 201 



In terms of weaknesses in the assessment instruments, three of the four scorers cited 
poor instructions and/or stimulus materials, such as instructions which did not always 
indicate the scope of the response reflected in the scoring criteria, badly written passages 
from which to construct multiple-choice questions, and ambiguous or poorly worded 
scenarios. Two scorers criticized the range of responses deemed appropriate, with one 
commenting that it seemed as if drawing on prior classroom experience negatively impacted 
the teachers 9 scores. Two of the scorers also questioned the sample of assessment activitie 
and whether it represents what teachers should know or use in actual classroom practice. 
Finally, one scorer each mentioned as a weakness the emphasis on principles of assessment 
instead of their application, arguable scoring criteria, and the particular representation of 
"good" practice in the exercises focusing on literacy. 

Comparison with other assessments. About 38% (8 of 21) of the teachers completing 
form A as a posttest and 57% (12 of 21) of the teachers completing form B compared this 
assessment format favorably with other assessment formats with which they had been 
evaluated. They appreciated the opportunity to explain their answers instead of having to 
choose among fixed options in multiple-choice questions, as exemplified by two teachers' 
comments: 

It better tests one's overall ability to communicate ideas and beliefs the* 
the CBEST or NTE. Multiple choice tests are very limiting and iu*re is 
always the guess factor to consider. That possibility is ruled out here. 
It's really difficult for me to say if I liked the questions better I would 
say that this is a much better assessment than multiple choice tests 
because I can put my thoughts down for you to see rather than just 
markings box answer that someone else put down for me to choose from. 

Some teachers preferred a combination of assessment format*, each as the teacher 
who suggested "a true blend of CBEST, observation, [*cd] essay. 11 i to teachers mentioned 
a preference for classroom observations. 

The scorers were asked to identify the unique contribution of the instrument to the 
assessment of elementary teachers, compared with other assessment methods. One scorer 
expressed qualified approval of the method employed: 

This method is clearly preferable to one-shot classroom observations. It 
does not capture how teachers conduct as& isments or use them in their 
classrooms, as portfolios could. It does attempt to establish general, 

5.31 



202 



standardized questions all teachers should be able to answer in the area 
of assessment, which is a tremendous advance over little or no concern 
given to general issues in assessment faced by teachers. 

One scorer had no opinion, and one expressed discomfort with a paper-and-pencil 
test for measuring how well teachers assess student progress. The fourth scorer summed up 
their evaluation with the following statement: 

I suppose that the most personal and concise way to convey my 
evaluation of this instrument is to say that I would not want my 
competency of assessment practices and skills assessed v/ith it in its 
present form and given the accompanying scoring criteria, particul&rly 
without receiving the accompanying training. 

Assessment Format 

Format Feature* 

The format employed by the Assessment of Competence in Monitoring Student 
Achievement in the Classroom is that of a paper-and-pencil test with brief written scenarios 
serving as stimuli to which teachers respond in writing. The assessmen also consisted of a 
design which involved testing before and after completion of a series of staff development 
workshops. 

Clarity of Awefament 

Because this assessment was in the developmental stage, the focus of the evaluation 
questions with respect to the assessment was on identifying problems in the assessment 
which could affect teacher responses. Teachers were only asked to jlaborate on their 
negative responses; some t "*chers also elaborated on positive responses. 

Clarity of questions. Teachers were asked if they believed chat the directions for 
each exerrise were clear. Eighty-one percent (17 of 21) of the teachers completing form A as 
a posttest and 57% (12 of 21) of those completing form B believed that the directions were 
clear. Those teachers who found some questions to be unclear wished for a definition of 
some specific terms, didn't understand the point of some exercises, or encountered 
directions which they believed contradicted their training, e.g., write a multiple-choice 



5,32 

p. 'i 

ERIC 



question assessing higher-order thinking skills when at one point the training stressed the 
that multiple-choice questions were not the best approach (although the training also 
included practice in writing such multiple-choice questions.) 

The scorers were asked if there were any exercises with which the teachers 
consistently had difficulty. Their responses identified several exercises where they believed 
that improved directions would have oriented a number of teachers whose answers went 
astray. These exercises included: 

• the exercises on both forms asking teachers to evaluate the soundness of various 
grading practices. Sometimes, the scoring criteria required two reasons to be 
given in order to receive full credit; however, this was not indicated in the 
directions, and many teachers gave a single reason. 

• the exercises on both forms addressing cultural norms which interfere with 
assessment. This exercise was misunderstood by many of the teachers. Instead of 
addressing how different assessments could avoid conflicting with cultural norms, 
teachers talked about either how to counter stereotypes in the classroom or 
addressed psychological or interpersonal issues rather than cultural ones. In 
addition, many teacher responses indicated cultural insensitivity in that they 
believed that the solution to the problem was varying degrees of encouragement to 
produce the desired behaviors (e.g., if a student's culture discouraged participation 
in classroom discussion, some teachers suggested calling on the student more 
often). 

• the exercise on one form focusing on defining features of a valid writing 
assessment. Instead of identifying features of a writing assessment (e.g., a 
pre/post design), teachers identified features of the writing to be evaluated (e.g., 
grammar cr presence of a theme). 

• the exercise on one form which asked teachers to compare three methods of 
assessing reading. Teachers commented more on the goals of reading instruction 
exemplified by these methods than on the assessment methods themselves. 

• the exercises on both forms requiring the construction of multiple-choice items. 
Several teachers constructed iill-in-the-blank or matching questions instead of 
multiple-choice questions. However, the term "multiple-choice* appear in 
capital letters in the instructions. The consensus among the scorers was that the 



6.33 

2n-i 



directions specifying the type of answer to be provided should have been separated 
from the introductory information into a oncluding paragraph, providing a 
focussed set of instructions for the teachers. 

In addition, the passages about which teachers were to write multiple-choice 
questions were considered by several scorers to be problematic. Neither passage 
appeared to be written to deliberately stimulate higher order thinking, so 
construction of questions meaLdring higher-order thinking skills was challenging 
and required going well beyond the information presented in the text. One scorer 
summarized the problem: "I found the passage itself so poorly written and 
difficult to comprehend that I wasn't sure how anyone could write a reasonable 
test question that went beyond recall but that students could answer." 

• In one exercise on one form, the term "draw the comparison" was used in the 
context of comparing two methods of assessing student understanding of 
differences between countries. Some teachers took this literally as meaning a 
physical drawing, while others assumed that the term meant a verbal comparison. 
(The latter was meant by the assessment developer.) 

Teachers also were easily distracted by exercises which contained more than one 
issue, for example, a verbal exchange between a teacher and student concerning a cheating 
incident which asked the candidate to assess whether or not the teacher took an appropriate 
course of action. When the presenting dilemma was complex, teacher responses sometimes 
reflected goals other than assessment, e.g., penalizing inappropriate behavior in the cheating 
incident described. This may result from teachers discussing issues with which they feel 
more comfortable, which due to insufficient instruction and practice in assessment, is 
unlikely to be the measurement of student achievement. However, the p 9sence of 
competing goals is precisely the context in which many assessment decisions will be made. 
One possible remedy for this is to recognize the competing goals in the presentation of the 
scenario, acknowledge that assessment goals have to be balanced against the competing 
goals, and ask the teacher to discuss the prompting problem solely in terms of 
accomplishing accurate assessment. For some of the exercises such as the cheating incident, 
this method of revision runs the danger of trivializing the problem, reducing the question to 
be answered to "Does this solution negatively affect the measurement of achievement?" 

Clarity of •coring criteria. Scorers were asked if they "had any difficulties in 
applying the scoring criteria for any of the assessment exerciser." Every scorer indicated 



6.34 



that they had difficulty with some exercises, wit T two of the scorers describing difficulties 
with eight of the twenty exercises. General difficulties included: 

• determining whether different statements are redundant or contain separate 
points 

• judging when a multiple-choice item is testing recall vs. higher-order thinking 
skills, especially when teachers fail to follow instructions that ask them to label 
the latter items and the item is poorly written 

• differentiating between general platitudes or instructional strategies not 
specifically related to the problem (which were not to be given credit) and 
responses which are specific enough to warrant credit 

• judging ambiguous responses which combine discussion of methods of assessment 
with other issues, e.g., goals of reading 

• determining whether a response is an example of a writing trait which is more 
than simply a variation of mechanics 

Many of these problems could be solved through the provision of more examples 
during training that are more typical of those being scored, and by providing more precise 
definitions (either conceptual or through a series of contrasted examples) of the borderline 
between acceptable and unacceptable responses. Revision of the instructions for some 
exercises to provide candidates as well as scorers with an orientation toward the type of 
response expected would also facilitate scoring. 

Evaluation of staff development training. Teachers who participated in the staff 
development training were asked to evaluate the sessions for two purposes: improving their 
performance on the posttest and for improving their ability to monitor student achievement 
in their own classroom. Twenty-nine teachers responded to the survey. 

In terms of improving their performance on the posttest, 10 percent (3 teachers) 
evaluated it as "very useful"; 78% (22 teachers) rated it as "somewhat useful"; and 14% (4 
teachers) rated it as "of little use." Several teachers praised the handouts and information 
presented. Many teachers reported being overwhelmed by the amount of information, citing 
a difficulty in recalling content presented in the early sessions, a need for more depth and 

5.35 

2(i« 



less breadth in the material presented, and a lack of time to complete the reading a" the 
handouts during the school year. 

As for improving their ability to monitor student achievement in their own 
classrooms, 24% (7 teachers) of the teachers evaluated it as "very useful"; 59% (17 teachers) 
of the teachers rated it as "somewhat useful"; and 14% (4) of the teachers rated it as "of 
little use." One teacher rated it as somewhere between "very" and "somewhat" useful. 
Several of the teachers felt that it made them better critics of tests, both published tests and 
tests they developed themselves. Some teachers wished for more examples at the primary 
(K-3) level. Other teachers felt that it was too early in their teaching career to assimilate 
the information, and hoped that they could find ways to apply it through summer study of 
the handouts. 

Over half (59% or 17) of the teachers reported using some of the principles presented 
in the staff development sessions in their monitoring of classroom achievement. 

Cost Analysis 

We will outline cost estimates for administering and scoring the Assessment of 
Competence in Monitoring Student Achievement in the Classroom based on the current 
status of the draft assessment that was pilot tested. We will also report the costs for 
developing this prototype and for pilot testing it. These costs should be taken as only 
preliminary estimates for costs that would be incurred if an assessment like this were to be 
further developed and modified for implementation on a wide scale. 

Administration and Scoring Costs 

This assessment is administered in a large group setting. The assessment can be 
administered by one or more persons with little or no training in the specific content of the 
assessment using procedures common to standardized group test administrations. 

Scoring the assessment requires training raters knowledgeable in the content of the 
assessment. Scoring of the pilot test data, which included both training and actual scoring, 
required two days for six scorers. The majority of this time was devoted to scoring the 
assessments with only approximately one-ha f day devoted directly to training. 

The six raters were able to complete nearly 400 ratings in the one and one-half days 
devoted primarily to rating. Dividing the 400 re lgs by nine scorer days (i.e. six assessors 



5.38 



ERIC 



207 



times 1.5 days) results in an estimate that a scorer can rate approximately 40-50 
assessments each day or approximately 4-6 per hour. Using $160/day as the cost for an 
scorer results in an estimate of approximately $3-4 per assessment rated. 

Training costs can also be estimated using the pilot test scoring experience. The 
half-day training in the pilot test cost approximately $80/scorer plus the costs for the 
trainer. In the pilot test the trainer costs were distributed across only six scorers. It would 
be feasible to expand the number of scorers that could be trained in one session. .Increasing 
the number trained to 10 would result in an estimate of $960 as the cost for a one-half day 
training for the 10 scorers, including an allocation of $160 for the trainer. If it is assumed 
that after training a scorer would spend two and one-half days rating and in this period 
could rate 40 assessment/day, the training costs would be distributed across 100 assessments 
per sccrer or 1,000 assessments for 10 scorers. Thus, a half-day of training would cost 
approximately $1 per assessment. However, the scorers found a half-day training to be 
inadequate; recommended changes would result in training lasting at least one day, doubling 
the estimated cost to $2 per assessment. Combining the training and rating costs would 
result in an estimate of $5-6 per assessment for scoring this prototype assessment. 

Costs for test administration, duplication of materials, postage, travel, etc. would also 
need to be added to the costs for scoring the assessments. We have used $30 per 
assessment for other large scale, group administered assessments as an estimate for these 
costs. Combining these results in the following estimated cost for administering and scoring 
this prototype assessment: 

Training and Scoring: $5-6 per assessment 

Administration/Other Costs: $30 per assessment 

Total Administration and Scoring Costs $35-36 per assessment 

Development and Pilot Testing Costs 

Cost* for developing this prototype assessment were $83,971 and are broken out by 
^jneral cost categories in Table 5.6 which also includes costs for pilot testing. These 
development costs are the expenses for the assessment developer to deliver the prototype 
assessment forms to the CTC and SDE. Additionally, $15,687 was expended in conducting 
the pilot test. These costs included those for FWL staff to observe teacher training provided 



6.37 

203 



TABLE 5.6 



DEVELOPMENTAL AND PILOT TEST COSTS FOR THE 
ASSESSMENT OF COMPETENCE IN MONITORING STUDENT 
ACHIEVEMENT IN THE CLASSROOM 



Staff-Salaries & Benefits 

Consultants 
(Teachers, assessors, 
and other consultants) 

Travel 'Consultants and 
staff) 

Other Direct Costs (Site 
rental, phone, 
duplication) 

Total Direct Costs 

Indirect Costs 

Total Costs 



$42,632 
10,000 



9,176 



6,394 



$68,103 
15,868 
$83,971 



$ 8,801 
1,840 



1,278 



465 



$12,384 
3,303 
$15,687 




6.38 



20 a 



by the assessment developer in which new teachers were trained in the concepts covered by 
the assessment. 

These provide samples of developmental costs that should be considered if a similar 
assessment were to be adapted for use. 

Technical Quality 

Development 

This assessment was developed as a successful response to a request for proposals to 
develop innovative forms of assessment for possible use in the licensure of new teachers in 
California. The assessment content was chosen based on a decade of research and 
development work by NWKEL staff to identify task demands of classroom assessment and 
to design staff development activities which enable teachers to meet those demands. To 
develop the two forms of the instrument, NWKEL staff constructed over thirty original 
exercises. Following review by members of the California Interagency Task Force, and then 
editorial revision, the exercises were assembled into pilot test packages and administered to 
a small number of teachers. Teachers participating in this "shakedown" were interviewed 
and asked to complete a questionnaire. Both the teachers' performances on the exercises 
and their opinions of the exercises were used to eliminate exercises which clearly did not 
work well. Exercises retained for further analysis were revised and edited to improve their 
clarity. After another review by the Interagency Task Force staff, the exercises were 
reviewed for potential bias, and were revised to eliminate any bias found. 

The final phase of assessment development was the creation of scoring criteria for 
each exercise. Exercises for which clear and defensible criteria could not be identified were 
eliminated. The preliminary scoring criteria were tested against a small subset of the pilot 
test responses, which resulted in major revisions. 

The resulting set of prototype exercises were then divided into two sets such that the 
content coverage was parallel. In many cases, this meant creating parallel forms of the 
same exercise. In other cases, different exercises covering the same material were chosen. 
An attempt was made to represent a variety of levels and subjects specified in the contextual 
information for each exercise. Another round of revisions based on further review by 
Interagency Task Force staff resulted in the two forms pilot tested in this assessment. 



5.39 



210 



Reliability 



The following analyses were performed on the pilot test data of the 50 teachers who 
took the pre-test and the 46 teachers completing the post-test. Interrater agreements were 
examined to assess the degree to which the scorers were able to consistently judge 
candidates using the scoring protocols provided. Internal consistency estimates were 
generated to assess the degree to which the variables or factors within each of the activities 
would form a measure and the degree to which the different activities related to each other 
and might form an overall assessment of a candidate. 

Interrater agreements. The first measure of agreements among scorers was the 
differences in total scores between scorers rating the same candidate responses. These 
differences are presented in Figure 5.1. The differences were sorted into three groups: 
paired scores differing by 0-2 points, those differing by 3-5 points, and those differing by 
more than six points. 

The degree of agreement among raters was such that only 15-18 percent of the 
rating pairs differed by 6 or more points which represents approximately 10 percent of the 
total possible. Ideally, it would be desirable to have a larger percentage of the rating pairs 
be within 0-2 points. But, given the draft nature of this prototype, these results suggest that 
the scoring criteria and system is such that raters can make similar judgments. The ability 
of the raters to achieve consensus on the ratingB with these scoring criteria should also be 
interpreted in light of earlier comments by the raters about the scoring system. This is, 
although raters could understand and follow the rating criteria, they have also identified 
areas in which revisions should be considered. 

Interrater correlations. Correlations between scorers also serve as an estimate of 
interrater agreement. The correlations among rater pairs are displayed in Table 5.7. 
Correlations were calculated for total scores separately by form and rater pair. The average 
correlations across rater pairs were also calculated for each form. 

These data also support a conclusion that agreement among raters can be achieved 
using the current criteria and system. It is likely that further refinements and revisions 
could result in even closer agreement between raters. 

Internal consistency of the assessment forms. Coefficient Alpha reliability estimates 
were calculated for the two forms by using the individual ratings for exercises or their 



5.40 



211 



FIGURE 5.1 

Rating Differences Between Scorers 



at 



Percent of Ratings 



100% 



80% - 



60% - 



40% - 



20% 



0% 




0-2 3-5 6 & up 



0-2 3-5 6 & up 



Form A 



Form B 



o 

ERIC 



212 



213 



TABLE 5.7 



CORRELATIONS BETWEEN PAIRED RATINGS FOR THE ASSESSMENT OF 
COMPETENCE IN MONITORING STUDENT ACHIEVEMENT 
IN THE CLASSROOM 



Form - 




. •■ Hater Fair" 


Averaged 
Pair Ratings 


1. 


2 


3 


4 


5 


A 


.86 


.88 


.84 






.86 




(15) 


(16) 


(16) 








B 


.88 


.80 


.89 


.67 


.87 


.84 




(6) 


(5) 


(5) 


(16) 


(16) 





* Rater Pairs for Form A and B are different. The numbers in parentheses are 
the number of teachers scored by each rater pair. 



subparts. Calculations were also done separately by whether or not the form was a pretePt 
or a posttest. These Alpha reliabilities are listed below: 



Form Reliability 

Form A .67 
(N=48) 

Pretest .70 
(N=26) 

Posttest .68 
(N=22) 

Form B .53 
(N=48) 

Pretest .56 
(N=24) 

Posttest .52 
(N=24) 

In contrast to the agreement among raters, the degree of internal consistency 
evidenced within the prototype assessment forms is modest to low. This suggests that in its 
current form the assessment does not form an overall measure of a single factor of teachers' 
knowledge of monitoring classroom practice, but includes items that measure somewhat 
independent factors. Further development would be needed to determine whether the 
assessment in this area would result in multiple factors and measures or whether the 
assessment might form a more homogenous measure of teachers' knowledge in this area. 

Validity of Agreement through Group Comparisons 

Differences in performances were examined for male/female, primary/intermediate 
grade, inner city/non-inner city, and white/minority teachers. This section uses the pilot 
test data to look for indications of any possible group differences in performance on the 
assessment. The pilot test sample size and design were not constructed to provide 
information sufficient to provide stable estimates comparing differences among these groups. 



6.43 



For instance, there were only two minority teachers in some of the groups. Nevertheless, 
an examination of differences among groups provides some initial insights into the validity 
of this assessment. Table 5.8 contains a summary of trends for the pilot sample of 42 
teachers who completed evaluation surveys, including demographic information Appendix 
C provides the means, standard deviations and numbers of candidates from wh^a these 
summaries were constructed. A plus (+) indicates that the mean or average for the first 
group was greater than that for the second group. For example, the pluses for the first 
column indicate that for two of the four tests, the average female score was greater than the 
that of the males. 

Trends in the table indicate that for this pilot test sample: 

• the average score of '> aales was higher than that of males half the time; 

• the average si j primary (K-3) teachers was higher than that of intermediate . 
(4-6) teachers on three of the four tests; 

• the average score of inner city teachers was lower than those of non-inner city 
teachers on all four tests; 

• the average score of white teachers was higher than those of minority teachers on 
three of the four teste. 

If these trends were to hold for larger, more representative samples, the only 
encouraging trend would be that for gender. No differences are large, and the average male 
score is higher than the fomale no more often than the reverse. The small sample sizes 
preclude the drawing of any conclusions about how the different groups of teachers might 
perform on this type of assessment. However, these trends suggest that any further work 
on assessments of this type should include the examination of possible differences between 
primary and intermediate teachers, inner city and non-inner city teachers, and white and 
minority teachers. 

Content validity. Evidence of the content validity of this assessment comes primarily 
from two sources. The firsl is the decade-long experience of the assessment developer in 
developing a curriculum for teaching assessment of student competence in the classroom, 
based not only on research but also on feedback from practicing teachers. The second is 



5.44 



217 



TABLE 5.8 



TRENDS OF MEAN DIFFERENCES IN PERFORMANCE BETWEEN 
CANDIDATES WITH DIFFERENT CHARACTERISTICS* 



ASSESSMENT OF COMPETENCE IN MONITORING STUDENT ACHIEVEMENT 

IN THE CLASSROOM 



•; : Xv.-.-Xv.-.-y-x •.■:•:•.-.<:•:•:•.-.•:•: . 
:•:•:•:•:•:-:•>:•;•: 

. . • • ' . • • . • .:■ 

. -y-x-x-x 

•. . . . 

v.-. v.v. v.v .•.*•••-. v. v. -.• v.- • • 

xx ;xxx-x x-x-x- x-x-x Xvx v.xvxv.v.v.v.v.- 

•: xx-xx ;• x-xx-x-x xvXvXvX^^vX\v.v.\^vX^v.v.v.^y.v.•.•.v.v.•. 

' j}.:; .•'''.'.•'}•'' 'V\ ' •'• : .' V ' : *' 

• Xv. v.xvXv. ,-x-x vXy.-.v.v • 

. " • ' v.v : : y. : • • . • . : 

;:y;;; . . 

xxxxv'x. '. v.v- -xV:' .• '.' . " 

•S:ii^:::;;.F<)r^ 


I-:-:-:-;-;-:-::-;-;-.-:-:-: x-x-x-x 
x-.xx- 

•V>xx:;x;:xv:v::x;xvx:>.v 
::>:v::;:- , :i:;x::;:;:i:;:v:;:;xix;:i:i: : .:;- 

Gender 
Fe.naie/ 


•i'l-X'XwX-l-MvIWw/XvIvXvX-; 
•xx: :•> •:•.•:•. xv.v.v.-.yXy.y.y.y.y 
x^xvXXvx-X'Xxx-x-x-x-XvXv 
.-.v.-.-.- y.y.v. .v.y.v.v.v.v.v.y.v.y 

XyXy.yXXy.y.y.yXyX : Xy.y.y.y 

v.v.y.y.yXy.; y.y.\y.y.y.y.y.y.y. 

Level of 

xx-x^^vx-x-xtvXvX- vXvXyX 

Teachjngf 


Teaorjlns 

Location 

inner-City/ 
Norv4nner 

IgV- 111 


Mfrwrtty^ 


A 










Pretest 


+ 


+ 




+ 


Posttest 








+ 


B 










Pretest 


+ 


+ 




+ 


Posttest 




+ 






SUMMARY 


2/4 


3/4 


0/4 


3/4 



* Entries reflect the direction of the mean differences for the different candidates. 
For example, on the pretest for Form A, the average mean of female teachers 
in the pilot test was greater than the males. These do not generally represent 
statistically significant differences and due to small N's no tests of significance 
were calculated. 



5.45 

2 IS 



the analyses of the congruence between the assessment and the various Model Curriculum 
Guides and the California Beginning Teacher Standards. 

Conclusions and Recommendations 

This section contains conclusions and recommendations regarding the Assessment of 
Competence of Monitoring Student Achievement in the Classroom, organized into the areas 
of administration, scoring, content, format, and a brief summary. 

Administration of Assessment 

Like other large-scale examinations, the Assessment of Competence in Monitoring 
Student Achievement in the Classroom is administered simultaneously to a large number of 
people. Benefitting from many years' experience in conducting such examinations, the 
administration of the actual assessment poses few logistical problems. The most crucial 
logistical activity is the selection of the assessment and staff development sites. Although 
the assessment portion of the Assessment of Competence in Monitoring Student 
Achievement in the Classroom is less expensive to administer than many of the other 
assessments piloted, the economy of scale achieved depends on the number of teachers 
participating at a single site. Therefore, the higher degree of centralization afforded by an 
assessment of this type may place larger burdens on teachers from rural areas who will have 
to travel some distance to a selected site. 

The exercises varied in their susceptibility to memorization of standardized answers 
which allow a teacher to pass the assessment without the knowledge and ability to apply the 
principles tested. The exercises which appeared to be least vulnerable to this type of 
coaching were more performance -based, and teacher responses were very dependent on the 
subject-matter content presented in the stimulus exercise. The exercises which were judged 
to be highly vulnerable were those involving listing of general principles which would be 
invariant across subject matter. We recommend that any future development of instruments 
oi this type focus on the more performance-based exercises. 

Scoring 

The scoring system is in need of further developmental work. The system for 
training scorers could benefit from the following improvements: 



6.46 

ERIC * 



• providing opportunities for scorers to practice scoring independently, with the 
scoring compared against a standard and problems in scoring discussed as a 
group; 

• a greater number of examples of scored responses; 

• examples that more closely resemble those to be scored; and 

• more precise definitions (either conceptual or through a series of contrasted 
examples) of acceptable and unacceptable responses. 

These revisions to the current scoring system should reduce the number of problems 
reported with its implementation and provide scorers with more concrete guidance to 
evaluate teacher responses. 

Assessment Co tent 

Our observations and information collected from scorers and teachers participating 
in the pilot test suggest the following conclusions about content: 

• Modifications of the exercises are necessary to bring the assessment into closer 
congruence with current curriculum guides and frameworks. 

• Most, if not all, of the assessment methods which might be used to assess the 
more in-depth knowledge called for in the latest curriculum guides and 
frameworks are represented in the current collection of exercises. 

• The exercises vary in their ability to evaluate a teacher's competence with respect 
to student diagnosis, achievement, and evaluation, as set forth in the California 
Beginning Teacher Standards. The more promising exercises were performance- 
based. 

• Most of the teachers believed that the exercises had relevance to their task of 
monitoring student achievement. 

• Teachers were mixed in their opinions as to whether they had been sufficiently 
prepared f or the assessment. This was true of the teachers who participated in 
the staff development activities as well. Those who did not attend the staff 

5.47 



2J0 



development workshops attributed their insufficient preparation to a lack of 
relevant coursework, while those completing the training cited a need for practical 
application and/or more experience. 

• As is true of many of the assessments pilot tested to date, the content of the 
Assessment of Competence in Monitoring Student Achievement in the Classroom 
focussed on the higher grade levels covered by the credential. Teachers were split 
on the appropriateness of the assessment for teachers at different grade levels, 
with half believing that it was fair across different grade levels, and half believing 
that it was not. 

• The assessment was designed to portray students from a variety of cultural 
backgrounds, as well as men and women in nontraditional roles. 

• Teachers overwhelmingly believed that the assessment was appropriate for 
teachers of diverse student groups. An expert on teaching diverse students 
expressed concern about several aspects of the assessment. The first is the ability 
of the assessment to accommodate alternative conceptions of teaching, particularly 
those which might be particular to specific types of classrooms. The assessment, 
for example, does not account for the contextual assumptions that are implicitly 
made by candidates and scorers. This is particularly important because little 
contextual inlormation is given in each exercise. Finally, the scoring guide do not 
provide for the possibility of correct responses which were not previously 
identified. 

• Teachers overwhelmingly believed that the assessment was fair to different groups 
of teachers. Scorers had mixed views, with concern expressed about (1) the 
possible advantage of people with strong written communication skills over those 
whose strength lay in oral communication skills, and (2) the possibility of passing 
people who had mastery of the technical language and abstract principles but not 
mastery of their application. 

• Less than half of the teachers believed that the assessment was an appropriate 
way of assessing their competency in evaluating student achievement, preferring 
instead methods which examined their actual assessment of their own students. 



6.48 




2J1 



Assessment Format 



The format is that of a paper-and-pencil test with brief written scenarios serving as 
stimuli to which teachers respond in writing, accompanied by a series of staff development 
workshops. 

Based on evaluations by teachers, scorers, and FWL staff, the following modifications 
in the assessment instruments are needed: 

• Revision of the directions for the exercises so that they clearly indicate the scope 
of the expected answer. 

• Identification of those exercises where a singular focus on assessment is likely to 
have a negative impact on other goals (e.g., self-esteem, classroom management). 
These exercises should either be eliminated or revised. Revisions should include 
acknowledgement (1) of the competing goals; and (2) that considering the effects 
on assessment is only one step in evaluating the practice portrayed. 

• Elimination of the goal of desigiiing "self-evident" scoring criteria, along with a 
corresponding expansion of the training of scorers. 

Most teachers rated the staff development training as "somewhat useful" in terms of 
improving their ability to monitor student achievement in their own classroom; however, 
compared to teachers who did not participate in the staff development workshops, the staff 
development workshops did not improve their assessment scores noticeably. 

Sumznaxy 

The Assessment of Competence in Monitoring Student Achievement in the Classroom 
needs substantial revisions in prompt materials, scorer training, and scoring criteria before 
it could be considered for use in licensing in California. Such revisions should proceed 
toward the development of more performance-based exercises, where teachers are asked to 
actually develop a portion of an assessment. 




CHAPTER 6: 

LANGUAGE ARTS PEDAGOGICAL KNOWLEDGE ASSESSMENT 



2J3 



CHAPTER 6: 



LANGUAGE ARTS PEDAGOGICAL KNOWLEDGE ASSESSMENT 

The Language Arts Pedagogical Knowledge Assessment (LAPKA), developed by 
Northwest Regional Educational Laboratory, is a series of video-based exercises designed to 
assess the pedagogical content knowledge of elementary school teachers in language arts. 
The Spring 1990 pilot test version of LAPKA consists of four exercises, each of which is a 
videotape of a typical language arts classroom situation (i.e., scenario). The candidate's task 
is to view the videotape and respond in writing to a series of questions about the instruction 
depicted in the scenario. For some of the exercises, candidates also receive selected support 
materials (e.g., stories read by students) to assist them in their analysis. 

Although each scenario depicts a language arts activity, they differ in the following 
respects: type of teaching activity, grade level, and group size. Scenario 1A and IB, fov 
example, are two versions of a teacher-led direct instruction lesson for a small group of first 
graders, while Scenario 2 depicts a teacher conducting writing conferences with individual 
students at different grade levels (i.e., 2nd, 4th, and 5th). Table 6.1 provides a summary of 
some of the characteristics of the four LAPKA scenarios. As described in the table, all of 
the scenarios depict a diverse student population (e.g., different ethnic groups, mixed 
abilities). 

Each scenario is approximately 20 minutes in length; however, each exercise takes 
about an hour and 15 minutes to complete because of the time required to view the 
videotape (which is viewed in segments) and then write responses to the questions. 

A more complete description of the content of each scenario follows: 

Scenario 1A: Teacbar^ad, Small Group Reading Lesson. Scenario LA is a teacher-led, 
direct instruction activity with a small group (eight students) of first graders. The general 
purpose of the lesson is preparing students to read. The focus of the activity is the story, 
Dragons and Giants, by Arnold Lobel. The videotape is divided into two sections. The first 
section features a pair of pre-reading activities: a vocabulary review and a word web. The 
second section shows the teacher and students orally reading the story, with particular 
attention given to the teacher's questioning strategies. 



TABLE 6.1 



SOME CHARACTERISTICS OF THE FOUR LAPKA SCENARIOS FOR ASSESSING 
A TEACHER'S PEDAGOGICAL CONTENT KNOWLEDGE IN LANGUAGE ARTS 





Scenarios 1A + 1B 


Scenario 2 


Scenario 3 


Teaching Task 


Teacher-led, direct 
instruction in 
reading 


Student 

writing 

conferences 


Literature 
discussion 


Grade Level 


1 


2,4,5 


4/5 combination 


Group Size 


Small group 


Individual 


Whole class 
and cooperative 
groups 


Student Population 


Range of reading 
abilities, ethnic 
backgrounds, and 
socioeconomic 
levels 


Diverse 
abilities and 
ethnic 

backgrounds 


Heterogeneous 
SES Students, 
predominately 
Hispanic, 
gifted program 



Scenario IB: Teacher-led Small Group Reading Lesson. Scenario IB is another 
version of a teacher-led, direct instruction reading activity with a small group (ten students) 
of first graders. The purpose again is to prepare students to read. The teacher conducts 
the lesson around the story Cookies, by Arnold Lobel. She begins by introducing the 
concept of will power, using actual cookies as teaching props. After a group discussion 
(including brainstorming) about descriptive words, the group reads the story. Next, the 
students make paper cookies and graph their results. The lesson ends as the group 
prepares for poetry writing. 

Scenario 2: Individual Writing Conference*. Scenario 2 consists of six writing 
conferences conducted by three different teachers. Each teacher works individually with a 
pair of students from either grades two, four, or five. The purpose for the writing 
conferences is to provide students with feedback on drafts of their writing. The conferences 
vary in length. A brief videotaped introduction precedes each pair of conferences. 

Scenario 3: Cooperative Group literature Discussion. Scenario 3 depicts a 
combination fourth/fifth rfrade gifted class in a cooperative learning activity in which the 
students examine the central theme of a story. Centered on a chapter from the novel, Sign 
of the Beaver, by Elizabeth Speare, the scenario is composed of three segments. The 
videotape begins with the teacher explaining her plans and goals for the lesson to the 
viewer. The second section shows the students working in small groups, reading orally, and 
discussing pre-formulated questions. The last section shows the teacher leading a discussion 
with the whole class. 

For each scenario, the candidates are asked to respond in writing to three different 
types of questions: (1) description, (2) evaluation, and (3) extension. That is, the 
candidates are asked to describ e important features of the pedagogical methods represented 
in tbo videotape, evaluate the effectiveness of these methods, and extend the principles 
underlying these methods to suggest ways of improving or changing the methods «hown. 
Listed below is an example of each type of question: 

EXAMPLES OF LAPKA QUESTIONS 

DESCRIPTION: Describe at least three important ways the teacher prepared the 
children for reading this story. (Scenario 1A) 



6.3 



227 



EVALUATION: Briefly comment on the strengths and/or weaknesses of this 
teacher's answer to the question on grading. (Scenario 2) 



EXTENSION: In what ways might the lesson have been altered to increase its 
effectiveness? (Scenario 3) 

The number of questions varies from scenario to scenario. Including subparts, the 
candidate responds to four questions in Scenario 1A, for example, but fourteen questions in 
Scenario 2. The candidate is expected to write about one paragraph per question, but the 
responses can be as brief as a phrase or as long as several paragraphs. Candidates are given 
approximately five minutes to respond to each question. 

To score the performances, the candidates' written responses are compared against a 
list of possible acceptable responses or, in a few instances, general guidelines. For each 
response, the candidate receives either zero, one, or two points-with a maximum score 
possible for each question. The scores given at the question level are summed to give a total 
score for the scenario. While not specified in the developer's final import, it is likely that a 
candidate's scores for all the scenarios would then be totalled and reported as a single, 
cumulative score. 

Administration of Assessments 

The following section presents a discussion of the logistics of administering the 
assessment, security issues, requirements for test administrators, and teacher end FWL 
impressions of administration. 

Overview 

LAPKA was administered at three sites, one in the Bay Area and two in Southern 
California between April 21 and May 12, 1990. As seen in Table 6.2, 42 beginning (first- or 
second-year) teachers participated in the pilot test. Four of the teachers were males and ten 
were members of minority groups. There were 21 teachers at grades K-3 and 21 at grades 
4-7. Slightly more than three-quarters of the teachers completed their teacher education 
preparation in California higher educational institutions, and one quarter received their 
preparation outside the state of California. Twenty-five teachers had taken two or more 
methods courses in reading; seventeen had taken none or one. All of the teachers except 
one indicated that some of their students spoke languages other than English. 



6.4 

c <s 



TABLE 6.2 
PILOT TEST PARTICIPANTS 
LANGUAGE ARTS PEDAGOGICAL KNOWLEDGE ASSESSMENT (LAPKA) 



(Number of Teachers = 42) 



Descriptive Characteristics of Participants 


Number 


Gender 




Male 


4 


Female 


38 


Ethnicity 




Asian 


1 


Black 


5 


Hispanic 


4 


Native American 


0 


White 


32 


Grade Level 




K-3 


21 


4-7 


21 


Teacher Training Program 




In California 


33 


Outside of California 


8 


No Response 


1 


Number of Reading Methods Courses 




0-1 


17 


2 or more 


25 



6.5 

229 



Each pilot test session was conducted in a four-and-one-half hour session by one or 
two test administrators. Each administration included an overview, two fifteen-minute 
breaks, and three of the four exercises (called scenarios). The teachers at the first two pilot 
test sessions completed Scenarios 1A, 2, and 3; the teachers at the third session completed 
Scenarios IB, 2, and J T t should be noted again that 1A and IB were two versions of the 
same type of exercise.) 

Logistics 

Administration required the following logistical activities: identifying a sample of 
teachers, sending orientation materials to teachers, administering the assessment, and 
acquiring feedback from the teachers. 

Identifying teacher samples. In recruiting elementary school teachers for the LAPKA 
pilot test, FWL aimed to identify a diverse group of teachers from a variety of eettings. At 
the same time, for the pilot test administration to be logisticaUy feasible and cost effective, 
we needed to identify administration sites that could be reached by car or public 
transportation by a sufficiently large number of beginning teachers in an hour or less. With 
the^ conditions in mind, we contacted a number of project directors from the California 
New Teacher Project in Southern California and the Bay Area. These project directors 
supplied FWL with a list of names and school sites of r <rst- and second-year teachers in 
their project. FWL contacted these teachers by phone to ask for their participation. All 
teachers were offered $80 to participate in the pilot test. 

Sending orientation materials. The orientation materials sent to teachers for this 
assessment included a two-page overview which described the content and format of the 
assessment, a brief description :>f the California New Teacher Project and its Assessment 
component, and directions to the administration site. 

Assessment administration. Although no special training is needr for the 
administrator(s) of this assessment, the role of the administrators) in this pilot test was a 
key one requiring the following administrative activities: First, test materials were 
distributed at the beginning of the session in three separate manila envelopes, one envelope 
for each scenario. Each envelope contained the following materials: (a) instructions to 
viewers, (b) support materials (e.g., story or chapter of a book), and (c) several sheets of 
questions. The test administrator then instructed the candidates to remove and read the 
sheet(s) of questions pertaining to the video segment about to be viewed. After the 
candidates had a chance to read the questions, the administrator showed the segment. The 

6.6 

2'Ml 



administrator then turned off the videotape and instructed the candidates to answer the 
questions pertaining to the segment. When the allotted time for answering the questions 
expired, the administrator instructed the candidates to read the next sheet(s) of questions 
pertaining to the next video segment. The administrator then showed the next video 
segment and repeated the process described above. Each of the three video exercises was 
administered in this manner. 

The three separate administration sessions were conducted in average-sized 
classrooms or staff rooms with groups ranging in size from twelve to twenty-five. For the 
two smaller groups, a single video monitor was adequate; for the larger group, two video 
monitors were necessary to insure that all of the candidates could clearly see and hear the 
tape. LAPKA could potentially be held in a large conference room with several hundred 
candidates taking the assessment at the same time, or be carried over closed-circuit 
television and shown in a number of rooms at the same time. The only requirements would 
be to have a video monitor for approximately every twenty candidates and enough staff on 
site to monitor candidates during the assessment. 

Collecting evaluation feedback. Immediately after viewing the videotapes and 
answering the questions, the teachers were asked to complete an evaluation feedback form 
in which they gave their thoughts and opinions about the assessment. 

Security 

For security purposes, the format and focus of LAPKA, as well as many of the actual 
questions, could remain unchanged with each new administration, but there would need to 
be a change in the content of the assessment each time. Also, while the overall scoring 
system could remain the same, new scoring criteria would need to be developed for any 
changes in the content. Thus, some development costs for LAPKA would be ongoing, but 
these costs might be significantly diminished on a per candidate basis if the assessment is 
administered simultaneously to a number of large groups of candidates. 

Another security-related issue is the influence of coaching. The coaching that might 
take place, however, would likely contribute to the teachers' professional development. For 
example, two typical questions in LAPKA ask the teacher to (1) identify two effective 
features of the teacher's language arts instruction, and (2) identity one important way that 
the teacher's instruction could have been enhanced. To practice for this assessment, 
teachers might join together to view videotapes of each other's language arts lessons and 
discuss them using the above two questions as focal points. This kind of "coaching" would 

6.7 



231 



likely improve not only teachers' performance on LAPKA, but their language arts practices 
as well. To prepare for this assessment, candidates might also memorize a "generic" list of 
effective language arts practices (e.g., activate students' background knowledge before 
reading), but learning such a list without developing an understanding of the principles 
from which the items on the list are derived is not likely to significantly improve teachers* 
scores on LAPKA (or their actual teaching). Security would be addressed through the use of 
new video taped segments and scoring criteria for them. 

Scoring 

The following section includes an explanation of the scoring process, a description of 
the training of scorers, and a discussion of scorers' perceptions, as well as those of FWL 
staff, of the training. 

Scoring Process 

As mentioned earlier, the candidates' responses to the assessment questions are 
scored by comparing them to the responses listed on a scoring key provided by the 
assessment developer. The scoring key lists both acceptable and unacceptable responses for 
each question, and also stipulates the number of points (i.e., zero, one or two) to be awarded 
for each response, as well as the maximum allowable number of points for any single 
question. 

The number of possible acceptable responses describe 1 in the scoring key ranges 
from two to ten for each question. For example, in the scoring key for Scenario 
lA/Question 1, the candidate is asked to "describe at least three important ways the teacher 
prepared the children for reading this story." The scoring key lists five acceptable responses 
for this question, allow one point for each correct answer, and sets the maximum number 
of points to be awarded at three. The acceptable responses listed in the scoring key for 
most of the questions are intended by the developer to be exhaustive; that is, any response 
by a candidate that does not correspond to one of the items on the scoring key is not 
awarded credit 

The scoring key also provides examples of responses that should mpi be credited. In 
Scenario lA/Question 1, three "no credit" responses are listed. For example, no points are 
awarded to a candidate who points out that the teacher "uses the table of contents to locate 
the story." The listed unacceptable responses are intended only as examples, but the most 
commonly expected responses that would not receive credit are included in the key. 

6.8 



For a few questions, the scoring key presents broad guidelines rather than specific 
examples of acceptable responses. In these cases, the scorer is expected to n 'y more heavily 
on his/her professional judgement, than on the scoring key. The scoring manual also does 
not provide an explicit rationale why some responses are acceptable and others are not, nor 
wfcy some responses are awarded one point and others two points. 

The teachers' responses to the assessment questions were all scored during the two- 
day training period for che scorers. This training session is discussed below. 

Scorers and Their Training 

The training of scorers and the scoring of the LAPKA pilot test responses took place 
at FWL over a two-day period, on June 26 and 27, 1990. The training and scoring session 
was directed by two staff members from NWREL, who were also members of the LAPKA 
development team. Six scorers participated in the two-day session. 

Scorer characteristics. The six scorers were all current or former teachers. Four 
were practicing elementary school teachers, ranging in experience from six to thirty years. 
One of the scorers was a former elementary school teacher and principal, and one was a 
former secondary English teacher who was a graduate research assistant with Stanford 
University's Teacher Assessment Project in elementary language arts instruction. All six 
scorers were female; one was African-American, and another was Asian-American. 

Training. The two-day training session was roughly divided into four half-day 
sessions, with each session devoted to the training of scorers and the actual scoring of 
candidate responses for one of the four scenarios. The trainers opened with a brief 
overview of LAPKA and then proceeded with training and scoring each scenario in order 
from Scenario 1A through Scenario 3. 

The procedure for training was as follows: One of the trainers began by presenting a 
brief overview of the scenario. The scorers then read relevant materials, such as the 
literature selection and instructions to the viewers, previewed the assessment questions for 
the upcoming segment of the videotape, and viewed a segment of the scenario. The trainers 
explained the scoring key for that segment of the exercise, then the six score.s individually 
scored the same two candidate responses for that segment and discussed any scoring-related 
issues. After the segment-by-segment training session for the scenario was completed, pairs 
of scorers independently scored sets of candidate responses for that scenario. 



6.9 

2;kj 



Because Scenarios 1A and IB were each completed by only half of the pilot test 
candidates~lA by one half of the candidates, and IB by the other half-they required 
slightly less time to score than Scenarios 2 and 3. Each scenario, with the exception of 
Scenario 3, was double scored. Scenario 3 was scored by a single scorer because of time 
constraints. Thus, interrater reliability scores were only available for Scenarios 1A, IB, and 
2. Each pair of scorers was assigned a subset (one-third) of candidate responses to score. 
The two scorers in each pair both scored the same candidate responses Idt did so 
independently. The three scorer pairs were reassigned so that each scorer wes paired with 
a different partner for each scenario. 

To help facilitate the scoring process, the trainers provided scoring sheets for the 
scorers to use in marking their scores. These sheets listed the individual scoring criteria 
down the right side of the page and blank lines on the left side of the page to indicate 
whether the candidate received credit for a particular criterion (see Appendix D for an 
example of a scoring sheet). 

Perceptions of training. The scorers' perceptions of the training they received was 
mixed: two rated the training as very good, two as adequate, and two as poor. The scorers 
offered several suggestions for improving the training. One of their major concerns was 
that the training was too brief. They thought that they were asked to score the pilot test 
responses before they had adequately reached a shared understanding of (1) the scoring 
criteria and (2) how narrowly or broadly to apply the criteria to candidate responses. The 
low interrater reliability scores (discussed later in the Technical Quality section) suggest 
that scorers were not interpreting the criteria in a consistent fashion. The scorers 
recommended that more time be allotted for training, particularly for working with practice 
materials. 

In addition, rather than viewing the scenarios segment-by-segment, the scorers 
thought it would have been more valuable to first watch each scenario from beginning to 
end without interruptions before attempting to score any candidate responses. This 
procedure, they said, would have helped them develop a better overall perspective for 
scoring the exercise. 

Finally, the scorers also felt that they were being asked to narrowly and 
mechanistically apply the scoring criteria, when they should have been given more latitude 
to interpret candidate responses and apply their professional judgment. This issue was very 



6.10 



9 

ERIC 



2:m 



important to the scorers and is discussed more extensively in the format section of this 
report. 

FWL staff agree with the scorers' recommendations that (1) the training be 
lengthened, and (2) the training should provide scorers with an overall understanding of the 
entire scenario and accompanying questions before any scoring is attempted. Other 
comments pertinent to any future training sessions can be found in the format section. 

Teacher and FWL Staff Impressions of Administration 

FWL staff members administered LAPKA on three separate occasions. No 
significant problems arose at any of the sessions. While a few of the students' comments on 
the videotape were inaudible (from any distance), none of the teachers reported that this 
problem interfered with his or her ability to respond to the exercise questions. Many of the 
teachers did, however, report fatigue from the four-and-a-half hours of assessment. 

Assessment Content 

The content of the LAPKA scenarios focuses on assessing a teacher's pedagogical 
content knowledge in the following areas: 

(a) reading; 

(b) writing; and 

(c) response to literature. 

As described earlier, the assessment asks candidates to view videotaped segments of 
language arts lessons which focus on each of the above areas, and then to comment on the 
pedagogical practices which are depicted. After viewing each scenario, the candidates are 
asked a variety of questions to elicit their knowledge about (1) the pedagogical content 
method(s) used by the teacher in the videotape, (2) the rationale for, or the effectiveness of, 
these method(s), and (3) other pedagogical content methods which could be used instead of 
or in addition to those depicted. 

While the LAPKA scenarios do not cover the entire range of instructional activities 
carried out by teachers in language arts, they represent a diverse set of activities that are 
essential to any successful language arts program. In addition, these scenarios cover 

6.11 

235 



instructional situations across a broad span of grades and student ability levels. It should 
be noted, however, that while the three areas listed above are generally acknowledged to be 
major components of a language arts program, the developers did not intend for the content 
of the assessment to provide a representative assessment in language arts. Instead, they 
chose these three areas as a focus for an alternative assessment approach-i.e., the use of 
videotapes as a stimulus for written responses. 

In the following pages, the content of LAPKA is evaluated along the following 
dimensions'. 



Congruence with the California English/Language Arts Model 
Curriculum Guide for Kindergarten through Grade Eight; 



• Extent of coverage of California Standards for Beginning Teachers; 

• Job-re) atedness of the instrument; 

• Appropriateness for beginning teachers; 



Appropriateness across different teaching contexts (e.g., grade leve.*, 
diverse student groups); 



• Fairness across groups of teachers (e.g., ethnic groups, gender); 

• Appropriateness as a method of assessment; 
Congruence with California Curriculum Guides and Frameworks 

FWL staff reviewed the content of the LAPKA assessment to see in what ways it is 
congruent with California's English-Language Arts Model Curriculum Guide for 
Kindergarten through Grade Eight (SDE, 1987). 

The guide presents recommendations for an effective English-language arts program 
in the form of twenty-two statements, referred to as guidelines. These guidelines are 
categorized in five major groupings: (1) the reading and studying of significant literary 
works, (2) classroom instruction basod on students' experiences, (3) integration of the 
language arts, (4) integration of English-language arts with other subject matter areas and 
settings outside the classroom, and (5) student evaluation. Within each of the five major 

6.12 



groupings are several guidelines which focus on specific features of an effective language 
arts program. The following section examines the congruence of LAPKA with these general 
groupings and guidelines. 

Grouping 1: The English-language arts program emphasizes the reading and the 
study of significant literary works. The three guidelines within this grouping stress the 
importance of providing intensive, direct instruction for all students in comprehending and 
responding to core works of literature, creating opportunities for students to explore and 
extend their experiences with literature, and supporting and encouraging students to read 
independently. 

Literature is emphasized in three of the four LAPKA scenarios (i.e., LA, IB, and 3). 
Instruction in these scenarios revolves around a literary work such as a children's picture 
book by a well-known author or a children's historical novel which is on the state's list of 
recommended literature (SDE, 1986). The candidate's task is to describe and evaluate the 
teachers' approach to instructing the students in comprehending and responding to the 
works of literature. The candidates are also asked to offer suggestions of other ways to 
provide such instruction. Thus, the emphasis on literature in LAPKA is strongly congruent 
with the curriculum guide. 

Grouping 2: The English-language arts program includes classroom instruction based 
on students' experiences. The two guidelines within this grouping deal with the importance 
of having students draw on their experiences while participating in language arts activities, 
and of students participating in activities designed to give them experience and knowledge 
needed to be proficient in the language arts. 

The teachers in the videotaped scenarios are shown providing instruction that draws 
upon their students' background knowledge and experiences. The candidates being assessed 
are expected to recognize this instructional practice as well as to suggest alternatives or 
ways to improve upon what the teachers actually did. The focus in LAPKA on basing 
instruction on s tudents' knowledge and experiences is strongly congrumt with the 
standards in this grouping. 

Grouping 3: English-language arts instruction is bas ' on an interrelated program in 
which listening, speaking, reading, and writing, with literature at the core, are taught in 
concert and are mutually reinforcing. The eight guidelines in this grouping focus on the 
development and integration of speaking, listeni: g, reading, and writing skills and 



6.13 



237 



strategies. This group of guidelines also addresses the teaching of decoding strategies, the 
conventions of the English language (e.g., spelling, punctuation), and handwriting skills. 



In the four videotaped LAFKA scenarios, the teachers engage their students in a 
variety of reading, writing, speaking, and listening activities. Candidates being assessed are 
asked to describe and evaluate the teachers' instructional practices and to offer alternative 
ways that they might approach similar teaching tasks. LAPKA's focus on the development 
and integration of the language arts in these scenarios is congruent with most of the 
guidelines in this grouping. LAPKA does not, however, address the teaching of decoding 
strategies, Language conventions, or handwriting skills, 

Grouping 4: English-language arts are an integral part of the entire curriculum. The 
seven guidelines in this grouping focus on the connection between English-language arts 
skills and other subject matter areas. Some of the topics covered are the use of higher-order 
thinking skills in English-language arts and other subject areas, broadening students' 
vocabulary, using the library and other media and technological resources, modeling of 
communication skills by school staff members, and involvement of parents in the 
educational program. 

Most of these guidelines fall outside of the focus of LAPKA. LAPKA does not, for 
instance, deal with the integration of the language arts in other subject areas or outside the 
classroom. Scenario IB does, however, present the teacher weaving another content area 
(e.g., math) into the reading lesson. For Scenarios 1A and IB, LAPKA also asks candidates 
to identify important features of the videotaped teachers' vocabulary instruction, and for 
Scenario 3, candidates are asked to discuss the teacher's goal of encouragi g students to 
think in new directions (i.e., higher-order thinking). Thus, congruence of LAPKA with this 
grouping should perhaps best be described as partial. 

Groupings: Evaluation of the English-language arts program includes a broad range 
of assessment methods. The two guidelines in this grouping stress the importance of multi- 
dimensional measures of assessment and student self-assessment. Scenario 2 addresses 
teachers' evaluations of student writing with some self-assessment by students included; the 
other scenarios do not address teacher assessment of students, student self-assessment, or 
program assessment. Congruence with this grouping therefore is weak. 

Table 6.3 summarizes the congruence of LAPKA with the English-Language Arts 
Model Curriculum Guide for Kindergarten through Grade Eight. Overall, LAPKA is 
strongly congruent with the guide in some areas and partially congruent in others. 

6.14 

eric 2:1s 



TABLE 6.3 



CONGRUENCE OF THE LANGUAGE ARTS PEDAGOGICAL KNOWLEDGE 
ASSESSMENT (LAPKA) WITH THE ENGLISH-LANGUAGE ARTS MODEL CURRICULUM 
GUIDE FOR KINDERGARTEN THROUGH GRADE EIGHT 



Model Curriculum Guide Content 


Relevant LAPKA Content • 


: cxieni or 
Congruence 


Grouping 1: The English -language arts 
program emphasizes the reading and 
study of significant literary works. 


-Study of literature is central 
to 3 of 4 scenarios. 


Strong 


Grouping 2: The English-language arts 

nrnnram inrluHpQ rlaQQrnnm inQtrufitinn 

based on students' experiences. 


-Role of student background 
knowledge is emnhasizeri in 

I\l IVfllwUUW Iw VI 1 IUI lUtflfaWU III 

all scenarios. 


Strong 


Grouping 3: English-language arts 
instruction is based on an interrelated 
program in whicn listening, speaking, 
reading, and writing, with literature at 
the core, are taught in concert and are 
mutually reinforcing. 


-Instruction portrayed in the 
scenarios is based on 
integration of language arts. 
The scenarios do not address 
decoding strategies, language 
conventions, or handwriting 

skills. 


Partial 


Grouping 4: English-language arts are 
an integral part of the entire curriculum. 


-The scenarios do not address 
language arts instruction 
outside of the classroom. 

Scenario 1 B addresses 
integration of language arts 

with other subject matter areas. 
Scenario 3 addresses critical 
thinking. Scenarios 1A and 1B 
address vocabulary 
development. 


Partial 


Grouping 5: Evaluation of the English- 
language arts program includes a broad 
range of assessment methods. 


-Scenario 2 addresses teacher 
evaluation of student writing. 
The other scenarios do not 
address evaluation. 


Weak 



6.15 

23!) 



Extent of Coverage of California Standards for Beginning Teachers 

The California Beginning Teacher Standards are criteria for teacher competence and 
performance that the Commission on Teacher Credentialing expects graduates of Crlifornia 
teacher preparation programs to meet. Listed below are brief italicized descriptions of the 
Standards (22 - 32) that pertain to expectations of student competencies to be attained prior 
to graduation from teacher preparation programs. (Standards 1 through 21 address 
programmatic requirements.) To evaluate this assessment instrument and make inferences 
about the assessment approach which it represents in terms of its appropriateness for use 
with California elementary teachers, the stimulus materials and scoring criteria for each 
exercise were compared with the eleven relevant California Beginning Teacher Standards. 
Each standard is discussed separately. 

Given that these standards are intended to guide the evaluation of teachers' 
performance in the classroom, an assessment such as LAPKA only indirectly addresses these 
standards. LAPKA measures teacher knowledge, but it does not provide direct evidence of 
teachers' ability to translate that knowledge into actual practice. 

Standard 22: Student Rapport and Classroom Environment Each candidate 
establishes and sustains a level of student rapport and a classroom environment that 
promotes learning and equity, and that losters mutual respect among the persons in a class. 
LAPKA does not address this standard. The teachers in the videotapes vary in their 
approaches to creating a positive and productive classroom environment, but the candidates 
are not asked to comment on this- feature of instruction. 

Standard 23: Curricular and Instructional Planning Skills. Each candidate prepares 
at least one unit plan and several lesson plans that include goals, objectives, strategies, 
activities, materials and assessment plans that are well defined and coordinated with each 
other. While the candidates in this assessment do not plan lessons of their own, this 
standard is addressed in a limited fashion in LAPKA. In Scenarios 1A, IB, and 3, 
candidates are asked to evaluate the features of another teacher's lesson and suggest 
alternatives and extensions to these lessons, which provides some indirect evidence of the 
quality of lessons that candidates might be capable of designing and carrying out in their 
own classrooms. 



6.16 



210 



Standard 24: Diverse and Appropriate Teaching. Each candidate prepares and uses 
instructional strategies, activities and materials that are appror late for students with 
diverse needs, interests and learning styles. LAPKA does not provide even indirect evidence 
of a teacher's abilities in this area. While a variety of instructional strategies are portrayed 
on the videotapes, candidates are not asked to comment on the appropriateness of these 
various strategies for meeting the diverse needs, interests, and learning styles of students. 

Standard 25: Student Motivation, Involvement and Conduct Each candidate 
motivates and sustains student interest, involvement and appropriate conduct equitably 
during a variety of class activities. LAPKA addresses this standard in a limited way. One 
of the candidate's tasks is to identify effective features of the various instructional activities 
portrayed in the four different scenarios. The candidate's comments on the ways that the 
videotaped teacher successfully motivates students, **"lu as through pre-reading activities 
(e.g., Sosnaric lA/Question 1) or supgestir r- .jtivating students, such as through pre- 
writing activities (e.g., Scv-iHo i' , '3uc3iie*i could provide some indirect evidence of a 
candidate's ability to motivate and susta^. student interest and participation. 

Standard 26: Presentation Skills. Each candidate communicates effectively by 
presenting ideas and instructions clearly and meaningfully to students. This standard, 
which focuses on teachers' presentation and communication skills in the classroom, is not 
addressed by LAPKA. 

Standard 27: Student Diagnosis, Achievement and Evaluation. Each candidate 
identifies students' prior attainments, achieves significant instructional objectives, and 
evaluates the achievements of the students in a class. LAPKA does not provide evidence 
about a candidate's ability to achieve his or her instructional objectives or to assess the 
achievements of students in a class. 

Standard 28: Cognitive Outcomes of Teaching. Each candidate improves the ability 
of students in a class to evaluate information, think analytically, and reach sound 
conclusions. This standard is addressed to a limited degree by Scenario 3. The videotaped 
teacher presents a literature-based lesson in which one of her goals is to develop students' 
decision-making and critical thinking abilities. The candidate's task for several of the 
questions is to discuss why it is an important goal for a language arts activity. Thus, a 
candidate *8 response to these questions provides some evidence of his or her knowledge of 
ways to improve the thinking ability of students. 



6.17 

211 



Standard* 29: Affective Outcome* of Teaching. Each candidate fosters positive 
student attitudes toward the subjects learned, the students themselves, and their capacity to 
become independent learners. A few questions in LAPKA address standard in a limited 
way. For example, in one of the individual writing conferences in Scenario 2, a teacher 
criticizes a student's writing at great length, and the student begins to visibly retreat by 
folding his arms and moving back. A candidate's comments about this encounter might 
provide some information about that his or her sensitivity to the affective dimension of 
teaching. 

Standard 30: Capacity to Teach Cross-culturally. Each candidate demonstrates 
compatibility with, and ability to teach, students who are different from the candidate. The 
differences between students and the candidate should include ethnic, cultural, gender, 
linguistic and socioeconomic differences. LAPKA does not address this standard. While the 
students portrayed in these scenarios come from a range of socioeconomic, linguistic, and 
cultural backgrounds, the assessment does not provide any evidence about a candidate's 
capacity to teach students who are different from the candidate. 

Standard 81: Readiness for Diverse Responsibilities. Each candidate teaches 
students of diverse ages and abilities, and assumes the responsibilities of full-time teachers. 
This standard is partially addressed in LAPKA. The four videotaped scenarios in LAPKA 
cover the span from elementary to middle school ages, and include a range of individual and 
group ability levels. The teachers in these scenarios present diverse instructional strategies 
for promoting the reading, writing, and oral language development of students of various 
grades and abilities. Scenarios 1A and IB present a direct instruction lesson in reading for 
first and second graders; Scenario 2 focuses on individual writing conferences across several 
grade levels; and Scenario 3 portrays a literature-based cooperative group activity in a 
fourth/fifth grade class. A candidate's comments about the effective and ineffective features 
of the instruction shown in these scenarios, along with his or her suggested extension 
activities, offers some indirect evidence of a candidate's potential to provide instruction that 
takes into account the needs of stuaents of different ages and abilities. 

Standard 32: Professional Obligations. Each candidate adheres to high 
standards of professional conduct, cooperates effectively with other adults in the school 
community, and develops professionally through self-assessment and collegia! interaction 
with other members of the profession. This standard is not addressed in LAPKA. 



6.18 




24: 



i 



The extent of coverage by LAPKA of the California Beginning Teacher Standards is 
summarized in Table 6.4. The table lists the LAPKA scenarios that address each standard, 
and also describes the extent of coverage provided. 

Job-Relate dness 

The teacher candidates who took the assessment and the scorers who evaluated the 
candidates' responses strongly agreed (89%) that the pedagogical content knowledge 
assessed by LAPKA is relevant to the job of teaching elementary language arts. However, 
several of these teachers questioned the reality of reduced class sizes and individual writing 
conferences portrayed in the videotapes: 

The teacher-led instruction had only ten kids. Get real! 

No teacher has time for one-on-one writing instruction. 

It is an unrealistic scenario [i.e., individual writing conferences] for a 
teacher responsible for 30+ kids. 

FWL staff agrees that the pedagogical content knowledge assessed by LAPKA is 
related to the job of teaching elementary language arts. As for the teachers' comments 
about some of the scenarios not being relevant to a "real" teacher's job, FWL acknowledges 
that extended individual writing conferences, for example, are probably difficult to conduct 
in most teachers' classrooms; however, the areas of pedagogical content knowledge (e.g., 
writing, reading) assessed by the scenarios are relevant to any job of an elementary teacher 
of language arts, and are not strictly related to the size of the groups depicted in the 
scenarios. 

Appropriateness for Beginning Teachers 

The discussion in this section focuses on the teachers' perceptions of the 
appropriateness of LAPKA for beginning teachers and their performance on the assessment. 

Perceptions. Most (78%) of the candidates reported that LAPKA was of appropriate 
complexity for assessing new teachers and that they have had sufficient opportunity to 
acquire the knowledge and skills necessary to respond to the assessment questions. 



6.19 

213 



TABLE 6.4 



EXTENT OF COVERAGE BY THE LANGUAGE ARTS PEDAGOGICAL KNOWLEDGE 
ASSESSMENT (LAPKA) OF THE CALIFORNIA STANDARDS FOR BEGINNING TEACHERS 







LAPKA Scenarios 






Standard 


'•: :; >ddr<wsingStan'dards" ' y : 


Coverage 


22: 


Student Rapport and Classroom 
Environment 


-None 


None 


23: 


Curricular and Instructional 
Planning Skills 


-Scenarios 1A, 1 B, and 3 


Partial 


24: 


Diverse and Appropriate 
Teaching 


-None 


None 


OK* 


Involvement and Conduct 


-Scenarios 1 A. 1 B 2 and 3 


Limited 


26: 


Presentation Skills 


-None 


None 


27: 


Student Diagnosis, 
Achievement and Evaluation 


-None 


None 


28: 


Cognitive Outcomes of Teaching 


-Scenario 3 


Partial 


29: 


Affective Outcomes of Teaching 


-Scenarios 2 and 3 


Limited 


30: 


Capacity to Teach 
Crossculturally 


-None 


None 


31: 


Readiness for Diverse 
Responsibilities 


-Scenarios 1 A, 1 B, 2 and 3 


Partial 


32: 


Professional Obligations 


-None 


None 



6.20 



211 



While the majority of the teachers felt prepared to take this assessment, several of 
them emphasized that most of what LAPKA taps is learned on the job after they have 
completed their education courses and student teaching: 

/ would have not been able to answer these questions just based on 
coursework and student teaching. Much of what was assessed was 
learned during my first year, not before. 

I think you'd need to teach before taking thi*. Books or class 
knowledge doesn 't necessarily give you the answers. 

I think all [of this assessment] should be given after experiencing a real 
classroom on your own. It's very different than student teaching. 

One teacher felt that a lack of experience accounted for her difficulty in answering 
some of the "extension" questions: 

I felt slightly inadequate to continue coming up v^th enhancement 
ideas. It seems that experience plays an important role in creating a 
variety of ideas and methods to be used in instruction. 

A few teachers reported that they did not feel prepared for almost any of the 
assessment: 

A lot of things J have never learned, either in the education program or 
in the new teacher program. My language arts in college taught me 
hardly any of this. 

Other teachers felt unprepared for certain sections of the assessment, particularly 
those that dealt with a grade level they had not taught or a subject area (e.g., writing 
instruction) with which they had limited experience. 

While the developers of LAPKA initially planned to portray only good teaching 
practices, the scenarios that were produced did not always achieve this goal. In response to 
those scenarios that did depict exemplary teaching, however, many of the teachers, in 
written and oral feedback, expressed excitement about the teaching practices portrayed, as 
well as an eagerness to try out similar methods in their own classrooms. One teacher 
commented: 



6.21 

215 



J thought [the assessment] was very enjoyable. I think teachers ought 
to be able to see other teachers teaching model lessons in order to gain 
insights into their own teaching strategies. 

Given that this assessment is aimed at beginning teachers, consideration should be 
given to the question of whether less-than-examplary teaching practices should be 
presented. On the one hand, as indicated by the teachers' comments above, scenarios that 
depict exemplary teaching practices offer the instructional benefit of affording teachers the 
opportunity to learn how to improve their teaching. On the other hand, one strength of 
videotapes is their capacity to ihow those things-both positive and negative--that a paper 
and pencil assessment can only describe (i.e., a picture is worth a thousand words). For 
example, LAPKA's scenario of a young student becoming very discouraged during a writing 
conference with his teacher demonstrates in a much clearer and more powerful way that a 
student's self-esteem can be immensely affected by a teacher than could ever be 
demonstrated in a short written description. Thus, scenarios of less-than-exemplary 
teaching practices could be used to assess a candidate's knowledge of incorrect or 
undesirable teaching practices. There is the risk, however, that the teaching practices 
portrayed would not be recognized as undesirable, simply because they are part of a state 
assessment. The questions accompanying such a scenario would have to be carefully 
designed so as to avoid this possibility. 

Performance on assessment FWL staff analyzed the teachers' overall performances 
as well as their scores on each of the four individual scenarios to see if the beginning 
teachers participating in this assessment had acquired the knowledge and skills measured 
by LAPKA. For Scenarios 1A, IB, and 2, candidate scores represent the average of two 
independent ratings, while the scores for Scenario 3 are the ratings given by a single scorer. 

The total number of points possible, the means, standard deviations, and ranges for 
each indhsdusl scenario are reported on Table 6.5. The candidates' average scores on the 
scenarios ranged from 57% to 87% correct, suggesting that the assessment is probably of 
appropriate complexity for beginning teachers. Making a few changes in the assessment 
format, such as eliminating or rewording the few ambiguous questions and more clearly 
marking the spaces for candidate responses, particularly in Scenario 2, would possibly result 
in higher scores. 

The candidates had the most difficulty with Scenario 2, the individual writing 
conferences, correctly answering only 57% of the questions. The candidates scored the 



6.22 

d 24 
ERIC C 3 



TABLE 6.5 



PERFORMANCE DATA FOR PILOT TEST TEACHERS (N=42) FOR THE 
LANGUAGE ARTS PEDAGOGICAL KNOWLEDGE ASSESSMENT (LAPKA) 





Total ...... 

Possible 

...... . ; . 


. . . • . . . . . • • •: 


Percent 


Standard 




Scenario 
Scenario 1A 


Points 

13 


iiHiieB 

8.80 


68% 


: ; Deviation 

1.97 


2-11 


Scenario 1 B 


4 


3.47 


87% 


0.66 


2-4 


Scenario 2 


39 


23.38 


f;7% 


6.03 


11-37 


Scenario 3 


35 


24.14 


69% 


5.00 


12-33 



247 



9 

ERIC 



248 



highest on Scenario IB, a direct instruction reading lesson, getting 87% correct. However, 
on Scenario IA, which is another version of a direct instruction reading lesson, the 
candidates scored only 68%. While the difference in the average candidate scores for 
Scenarios 1A and IB suggests that these two scenarios may have been tapping different 
areas of candidate knowledge, the small number of questions in Scenario IB (4 points) 
makes any comparison between these scenarios tenuous. 

There was a wide range of scores for each scenario. For each scenario, several 
candidates obtained near perfect scores, (or, in the case of Scenario IB, perfect scores), 
while other candidates missed a large percentage of the questions. This outcome suggests 
that LAPKA may effectively discriminate between weaker and stronger candidates. 

The performance data indicated no notable differences in candidate performance 
based on gender, ethmcity, grade level taught, California or non-California teacher training 
program, or the number of reading methods courses taken. 

Appropriateness across Contexts 

The LAPKA scenarios portrayed language arts instruction in a variety of contexts- 
across grade levels, with a diverse group of students, across the language arts, and with a 
variety of instructional settings. The following sections look at the perceptions of the 
teachers regarding the appropriateness of the assessment across certain contexts, as well as 
the perceptions of Sharon Nelson-Barber, our consultant on cultural diversity. 

Grade level. The multiple subjects credential issued to elementary teachers spans 
the grades K-8. Thus, any assessment for an elementary teacher should in some way 
address that teacher's capacity to teach the grade levels covered by the credential. The 
LAPKA assessment includes scenarios depicting instruction by teachers to students of 
different grade levels, and assesses candidates on their knowledge about this instruction. 
Most of the teachers (68%) who participated in the pilot test of LAPKA believe that the 
assessment is appropriate for teachers of different grade levels. One teacher commented: 

You never know what grade level you'll be teaching-there fore, {you] 
need to be accountable for all 

Diverse students. Most of the teacher candidates (63%) indicated that this 
assessment is appropriate for teachers of diverse student groups. Several of the teachers 
who disagreed gave the following comments: 

6.24 



21:1 



J saw very few if any blacks or Hispanics. The [videotape] didn 't show 
California ethnic groups. 

There could be some videos with LES or bilingual classroom settings, 

[The videotape didn't show] classrooms where students are very 
disruptive or where reading ability [is] very low. 

The above comments are important ones, especially in light of the fact that one of 
the advantages of using videotapes is the ability to • how a mixture of students in a 
classroom-not just describe the students. Moreover, those students who usually present a 
challenge to beginning teachers are those of limited-English proficiency or low reading 
ability-and not the typical GATE students as shown in Scenario 3. Should the LAPKA 
assessment be developed further, consideration might be given to including scenarios that 
depict students whose diversity is not only commonly represented in California classrooms, 
but also commonly presents a challenge to beginning teachers. 

Sharon Nelson-Barber, our consultant on cultural diversity, brings up another issue 
to be considered with regard to LAPKA's appropriateness across contexts. It is her belief 
that the LAPKA assessment is built around a conceptual framework that consists of "certain 
assumptions about how classrooms should be organized, how students should be rewarded, 
how talk should proceed, etc." These assumptions, according to Nelson-Barber, have "the 
potential to miss many of the instructional techniques and interactive behaviors deemed 
effective in some minority communities." As she comments, 

Teaching in multicultural classrooms requires going beyond the 
teaching of content to the relationship of that content to students 9 
broader contexts-tbeir social environments, their communities, their 
attitudes, even their feelings. 

For example, teachers of black students in predominantly black communities may 
respond to that environment by viewing the teaching of basic skills as essential to their 
students 1 survival in the mainstream community. They may put a focus on grammar, 
punctuation, spelling, etc. as a way of moving their students towards mastery of the 
mainstream language. Similarly, teachers of classes of predominantly Asian students in 
predominantly Asian communities may respond to the high academic expectations of the 



6.25 

25!) 



community by focusing on grading or by considering grading to be an important part 
their classroom gestalt. 



According to the LAPKA assessment, however, it is likely that teachers in the above 
contexts would be penalized. For Scenario 2, Part A, Question 2, for example, if the 
candidate makes reference in his/her response to correcting grammatical, punctuation, 
spelling errors, etc., the candidate is to be awarded zero points. For Question 1, Part D, of 
the same scenario, the candidate is also awarded zero points if s/he emphasizes the 
importance of grading without making reference to the value of composing or organizing 
thinking. Although the LAPKA scenarios do not depict classes or groups largely composed 
of black students or Asian students, teachers who are used to such contexts may have a 
concept of teaching which differs from that of the assessment developers. 

Another example given by Nelson-Barbw is that of teachers of students "whose only 
prior interaction with text may have been holding a hym-ial or observing the priest reading 
in a foreign tongue." For these students, preparation tor reading a story may necessitate 
such basic preliminaries as asking the students to locate the story in the table of contents. 
The LAPKA assessment, however, does not recognize such preparation; if a teacher 
mentions directing students to the table of contents in his/her response to Question 1, 
Scenario 1A, zero points are awarded for the answer. 

It might be argued that, because the LAPKA scenarios do not show classrooms or 
groups of students as described by Nelson-Barber, none of the teachers' responses should 
resemble those suggested by Nelson-Barber. This argument, however, does not take into 
consideration the beginning teacher's experience and general approach to teaching. 
Teachers who are used to working with black students in predominantly black tommunities, 
or with Asian students in predominantly Asian communities are likely to respond to the 
LAPKA questions with those contexts in mind. Although the answers they give may not be 
appropriate for the students in the videotape, should they be penalized for describing 
practices that research claims are effective for the students they teach? 

FWL staff suggests that consideration be given to incorporating into the assessment 
a way of ascertaining the teacher's philosophy of h itruction and context of experience so 
that his/her responses to the assessment questions can be judged within that framework. 
Without such a revision to the assessment, it is difficult to conclude that LAPKA is fair 
across contexts. 



6.26 



9 

ERIC 



Fairness acroM Groupf of Teachers 



While a high percentage (89%) of the teacher candidates felt that this assessment 
was fair to new teachers of both genders, different ethnic groups, and different language 
groups, this view was not shared by some of the teachers, the scorers, and our consultant on 
cultural diversity. Of major concern was the lack of teacher diversity shown on the 
videotapes. One of the scorers, an Asian-American female, summed up the problem as 
follows: 



Future LAPKA assessments need to include videotapes that show men 
as well as Hispanic and African-American teachers. Although the 
majority of teachers in California currently are Caucasian, given 
changes in the state's demography and the need for better minority 
representation in the teaching pool, assessment exercises used for 
certification should show the ethnic diversity of teachers that are found 
in the state's schools. 



Nelson-Berber, our consultant on cultural diversity, agreed that the videotapes 
should have presented more diversity with respect to the teachers portrayed giving the 
lessons, but she also warned that presenting a diverse group of teachers will serve little 
purpose if the "diverse group of teachers demonstrates only one view of teaching." 

As was described in the section above, distinctly different approaches to teaching are 
often utilized by teachers of different student groups. Nelson-Barber points out that the 
teachers most likely to use these different approaches are teachers of the same racial or 
ethnic group as their students. Black teachers of black students, for example, are very likely 
to emphasize grammar, spelling, punctuation, etc. in their instruction (Delpit, 1986, 1988)-a 
teaching practice penalized by the LAPKA assessment (see the section above). Research by 
Kleinfeld (1974) shows that the degree of "teacher-directedness" or the proximity of teacher 
to students can make real differences in the educational lives of American Indian students, 
who are accustomed to the "affectively intense and particularistic relationships characteristic 
of small traditional societies." Native teachers of American Indian students are most likely 
to view management and instruction as intimately tied-another practice penalized by the 
LAPKA assessment (e.g., the script for Scenarios 1A and IB explicitly asks candidates to 
ignore the videotaped teacher's management practices; candidates are awarded zero points if 
they mention management issues or management-related activities in their answers to 
questions in Scenarios 1A and 2). 



6.27 



252 



Still another example supplied by Nelson-Barber is that of black teachers displaying 
a great deal of emotion with black students. Although an outside observer (or an 
uninformed assessor) may perceive such a teacher as "authoriturian, pushy, or harsh," 
Nelson-Barber cites research that shows that "for some members of the African-American 
communities, teachers who do not exhibit these behaviors (i.e., genuine affective displays) 
may be viewed as ineffectual, boring, and uncaring.'" Once again, however, the LAPKA 
assessment penalizes this practice (i.e., a candidate is awarded zero points for mentioning 
"questions about feeling" in his/her response to Question 3, Scenario 1A). 

Thus, as currently designed, it is hard to judge the LAPKA assessment as sensitive 
to differences across groups of teachers. 

Appropriateness as a Method of Assessment 

Teacher candidates and scorers largely agreed that LAPKA assesses knowledge 
relevant to elementary language arts instruction, but they expressed reservations about its 
appropriateness for assessing teachers' pedagogical competence. Their criticisms of LAPKA 
tended to fall into three categories: (1) LAPKA only measures teachers' ability to evaluate 
other teachers, (2) LAPKA measures what teachers say they do rather than what they 
actually do in the classroom, and (3) the true measure of good teaching is student learning, 
which LAPKA does not assess. Some of the teachers' comments are as follows: 

This hasn't given any thought to see if I as a teacher in my own class 
can take techniques and adapt to my own class or if I can be successful 
in teaching it. 

It assesses my ability to intellectually evaluate a lesson but does not 
evaluate my actual teaching . 

Direct observation of teacher is better, or videotape the teacher. 

Success in teaching isn't measured on a piece of paper. The 
achievements and growth of students is success in teaching. 

The complaint by teachers that the LAPKA assessment does not evaluate "actual 
teaching" is a complaint often echoed by other teachers about other altci native assessments, 
especially those that are not performance-based. Perhaps if teachers had been asked to 
consider the appropriateness of LAPKA as one of several methods of assessing new teachers, 

6.28 



ERIC 

2f) A 



their responses would have been different. Perhaps, too, if the teachers did not perceive the 
assessment as asking them to evaluate other teachers* instructional pract ces, but rather as 
asking them to demonstrate their knowledge of appropriate instructional practices in the 
areas of reading, writing, and literature analysis, the teacners would have considered 
LAPKA to be an appropriate method of assessment. It is the opinion of FWL staff, 
however, that it is unlikely that the teachers 1 perceptions of this method of assessment 
would change unless the content of the assessment were changed to better take advantage 
of the video stimulus. 

Comparison with other assessments. In addition to being asked the question of 
appropriateness as a method of assessment, all of the teachers were asked the following: 
"How does this assessment format (i.e., answering questions after viewing videotaped 
lessons) compare with others with which you have been evaluated (e.g., multiple-choice for 
CBEST and NTE Speciality Area Tests, classroom observations during student teaching) in 
terms of its assessment ability?" While the teachers did not see LAPKA as appropriate as 
the sole measure of their teaching ability, they viewed it as a valuable supplement to direct 
observations and a significant improvement over multiple choice formats: 

Hike this format much better than multiple choice tests. I feel it is a 
better test of knowledge application. However, this type of test should 
not replace classroom observation. 

In comparison to the multiple choice section, you're able to explain 
reasons for answers. In comparison to classroom observations, you 
don't feel nervous and stress from being watched. 

[This is] better. The CBEST and NTE are so general as to be totally 
vague. They do not address the specific abilities and problems teachers 
face in actual classroom setting 

Only one candidate opposed the video format: 

I feel this assessment rewards people who watch television and 
penalizes people who read. 

One teacher stressed that if we change the way teachers are assessed, then we also 
need to change the way teachers are taught: 



6.29 



25} 



If this is the way student tenners are tested then the curriculum needs 
to be adjusted to include more methodology and more practice 
scenarios. 

FWL staff concurs that, if this method of assessment were adopted by the state, the 
teacher preparation programs would do well to incorporate into their methodology classes 
the use of videotaped scenarios of teachers teaching. 

Assessment Format 

LAPKA's format is that of a written, constructed response assessment with a 
videotape stimulus. The assessment consists of four videotaped scenarios of teachers 
teaching, each of which is shown in segments. After each segment the teachers are asked 
to respond in writing to a series of questions about the material just viewed. 

The format of the assessment is discussed by looking at the clarity of the following: 
(1) the materials sent to teachers in preparation for the assessment, (2) the assessment task 
materials (i.e., directions, literature, and questions), and (3) the scoring criteria and 
procedures. 

Clarity of Teacher Preparation Materials 

Prior to the assessment, the teachers received a two-page information sheet that 
included a brief explanation of the purpose of the assessment and logistical details (e.g., 
time, location), along with a two-page description of LAPKA supplied by the exercise 
developers. Most of the teachers indicated that they were satisfied with the materials that 
were sent to them. 

When asked if there were any additional materials that would have been helpful to 
have in preparation for the assessment, the teachers made several suggestions. The most 
commonly made suggestion was to include some information about the scoring criteria. 
Another suggestion was to warn teachers to "be prepared to write a lot." Finally, one 
teacher thought the materials could be improved by providing more information "how the 
results will be used." 



6.30 



9 

ERIC 



o r r 



Clarity of Task Materials 



The format for each of the four LAPKA scenarios included the following: 
"Instructions to the Viewers" given to the candidates at the beginning of each scenario to 
provide an "advance organizer" for the upcoming tasks, followed by a series of questions to 
be answered after viewing the scenario. In addition, for three of the four scenarios, teachers 
were provided the piece of literature which was the central focus of the videotaped lesson. 

Although the teachers did not report any difficulty in reading or interpreting the 
literature selections for the three scenarios, the teachers did have some suggestions for 
improving the directions given at the beginning of each scenario and some of the assessment 
questions. These suggestions are presented below along with some suggestions made by 
FWL staff with regard to the assessment materials in general. 

Suggestions for improving the task directions. Some of the teachers thought the 
directions preceding the scenarios could be improved by providing information that 
informed the viewer of the length and content of the upcoming video segment. Teachers 
reported being caught off guard several times by segments that were very brief, particularly 
some in Scenario 2, and as a result they felt unprepared to answer the questions. 
Commented one teacher, 

Part F [in Scenario 2] was far too short! It would have been nice to be 
forewarned that it was approximately one minute in length. 

Teachers also reported being misled by the content of some the videos. In particular, 
in Scenario o some of the candidates reported that they expected the opening segment to 
include not only the teacher discussing her goals for the lesson but the actual lesson as well, 
and were surprised when the video was stopped and they had to answer the questionj based 
only on the teacher's comments. These teachers' expectations were not unfounded as the 
prompt for Scenario 3 described the first video segment as containing the teacher 
describing her objective and conducting the introduction to the lesson." 

Based on FWL staffs observation of the confusion experienced by some teachers 
during the pilot test administrations and on our examination of the directions for the 
assessment tasks, we agree that the directions to the tasks could be improved by including 
the length of the segment to be viewed and an accurate description of the segment's 
content. 



6.31 



Suggestions for improving the question*. Not all of the questions in the scenarios 
were clear to the teachers and so there were several recommendations that the questions be 
reworded. One teacher explained, 



Not always clear as to what you wanted responded to-wording not 
always clear. 

In particular, some of the questions from Scenarios 2 and 3 were identified as 
problematic. In Scenario 2, for example, Question 1, Part C asks the candidates to "identify 
the elements of an integrated approach to language arts which are present in this segment." 
Most of the candidates interpreted the word, "integrated," to refer to "integration across the 
curriculum" rather than the developers' intended meaning of "integration of the language 
arts" (i.e., reading, writing, speaking, listening). A typical comment was: 

The integrated language arts video question was confusing because I 
have a different meaning for integrated language arts. 

Rephrasing the question to make explicit the concept of integrating the language arts 
would reduce the confusion. 

In Scenario 3, some candidates experienced confusion with Question 1, Part A, which 
asked them to "identify and discuss two pieces of background information whi'h are 
relevant to the lesson." The teachers were uncertain whether "background informatioi ■." 
referred to their own background information, the teacher's background information fir the 
lesson, or the students' background knowledge. A different word choice for "background 
information," or a rewording of the question, would, probably take care of the problem. 

Other suggestions for improving the materials. In addition to supporting the above 
teachers' recommendations, FWL staff suggest that consideration be giwn to the following 
recommendations. First, based on a review of the candidates' responses, the spaces for 
candidates' responses could be more clearly labelled. In some instances, questions were 
skipped over by candidates because the space for responding to them was not clearly 
marked (e.g., Scenario 1A, Question 3B on "One suggested improvement"). 

Second, we suggest that a review should be made of all the LAPKA questions so that 
only questions which are strictly dependent on viewing the videotape are included. In 
Scenario 2, for example, Question 2, Part B, asks the teachers, "Is publication* or •sharing' 



6.32 

257 



an important part of the writing process? Why or why not?" This type of question can be 
answered without viewing the videotape and therefore does not take full advantage of the 
assessment method of video stimulus. Similarly, in the same scenario, another question 
asks the teachers to "comment on the strengths and/or weaknesses of this teacher's answer 
to the question on grading." A question of this type also does not take full advantage of the 
video stimulus because it could also be answered by the candidate by providing him/her with 
a script of the student's question and the teacher's response. Since the cost of videotapes, 
videotape equipment, etc. is much higher than paper-and-pencii assessments, the questions 
that are part of a video stimulus assessment ehould strive, as much as possible, to take full 
advantage of the video medium. That is, consideration should only be given to including 
questions that can not easily be answered through some other assessment format* 

Third, although most of the teachers indicated that overall they had enough time to 
complete the assessment, several teachers stated that for Scenario 3 they would have liked 
more than the allotted time. Of more importance, however, is the fact that many of the 
teachers felt the entire assessment was too long. FWL staff suggests that consideration be 
given to the following teacher recommendations for reducing the time of the assessment: 
reduce the length of the videotape lessons; show fewer individual writing conferences; ask 
fewer questions; and eliminate any redundancies in questions. 

Clarity of the Scoring Criteria and Procedure! 

Important concerns were identified by the scorers, our consultant on cultural 
diversity, and FWL staff regarding the LAPKA scoring system. These concerns, while 
overlapping to some degree, are presented as separate points. 

Although the criteria and procedures for scoring the candidate's responses are quite 
straightforward-i.e., a candidate's responses are usually compared to a Hat of pre- 
determined acceptable responses in the scoring key-the general consensus is that the 
scoring key is too narrow in scope and excludes many acceptable responses. In scoring the 
pilot tests, the scorers identified many responses that they believed exemplified effective 
language arts practices but were not credited because they were not listed on the scoring 
key. Some typical scorer remarks are as follows: 

The scoring criteria are much too narrow in scope to allow for the 
variety of possibly valid answers. 



6.33 



25S 



Answers would med to added that ore just as appropriate as those now 
listed. The answers listed . . - reflect an interpretation of language arts 
pedagogy that is very limited. 

Because the scorers were advised by the trainers during the pilot test scoring session 
that their role was to score the candidate responses against the scoring key and aaS to add 
to or change it, the responses that fell outside the scoring key-but were perceived as valid 
by the scorers-were not discussed among the scorers and trainers d'. ing the scoring 
session. Thus, the scorers were given no opportunity to modify the scoring key. 

Given the scorers' deep reservations about the narrowness of the scoring criteria, 
further development of this assessment might include a review of the pilot test responses 
and an examination of the responses that were not credited but that the scorers found to be 
acceptable practices. Through such an analysis, along with additional rounds of pilot testing 
and analyses, a more comprehensive list of acceptable responses might be developed. 
However, a potential problem could arise if through this process the list of acceptable 
responses becomes unmanageably long. A possible solution is to shorten the lists by 
combining many of the specific practices listed in the scoring key into a few general 
guidelines. This approach to scoring, however, requires a greater dependence on the 
scorers' professional knowledge and would require more extensive training to insure that 
the scorers understand and are applying the scoring guidelines in a consistent fashion. 

The scorers also felt that too often the scoring key implied that there was a single 
"right answer," and ignored the influence of context on decision-making. One scorer 
expressed her concern as follows: 

If there were true "right answers' to the myriad problems that teachers 
face everyday in their classroom, then school reform or even the reform 
of language arts teaching in elementary schools would be a simple 
matter of making sure everyone knew the "right answers." . . . There 
are ways of handling situations that are more or less appropriate for a 
given context, but no "right answers* that can remedy any si' < . 
even the ones shown on the videotape. 

Another concern expressed by the scorers was that the assessment's analytic scoring 
focused their attention too much on details and not on the candidate'8 overall performance. 
Suggested one scorer: 

6.34 



25« 



J believe that some type of holistic scoring would be the answer. It 
would allow the candidate to be evaluated on his/her entire 
performance, rather than on tiny bits of it 

A more holistic scoring system would perhaps address the scorers concern that the 
present scoring system does not allow scorers' sufficient latitude to apply their professional 
knowledge to make an overall judgment about a candidate's knowledge and skills. The 
scorers felt that they were required to apply the scoring criteria in too rigid a fashion, even 
when they had evidence from the candidate's response that the candidate lacked 
understanding of a particular concept or had misapplied the concept in the response. 
Allowing the scorers more latitude would enable them to look across a candidates' responses 
to find evidence of the candidate's understanding. Commented one scorer: 

It frightens me that some candidates give an answer that fits within 
the p; rimeters of acceptable answers and thereby get credit for it, but 
indi elsewhere or even in the rest of their answer that they adhere 
to questionable language arts approaches. 

The scoring key was perceived by scorers as accepting "buzz words" as correct 
responses without examining the candidate's understanding of the concept underlying the 
word. One scorer remarked that the scoring system assessed teachers' "ability to me edu- 
speak." 

The scorers also agreed with the teachers that some of the questions and directions 
were confusing, thus resulting in answers different from what the developers intended, and 
presenting a problem for scoring. One scorer remarked: 

Another difficulty was scoring answers to questions that had obviously 
not been understood by the test-taker in the same way as they had 
been understood by the test-maker. [Also,] it was very difficult to score 
questions where the directions or layout of the questions apparently 
confused the candidate. 

Finally, other concerns about the fairness of the LAPKA scoring criteria and 
procedures with regard to different groups of teachers and teachers of diverse students were 
raised by our consultant on cultural diversity, Sharon Nelson-Barber. These concerns were 
already discussed in the Content section of this chapter, but to briefly restate her concern, 
the LAPKA scoring criteria and procedures seem to support one way to teach the language 

6.35 

2fi() 



arts to the exclusion of others. Nelson-Barber points out that this way is not necessarily 
appropriate for all students or teachers and that alternative approaches could also be 
appropriate. With the present scoring criteria and procedures, however, scorers who are 
aware of the many ways of teaching diverse student groups are unable to apply this 
knowledge. 

Cost Analysis 

Administration and Scoring Cost Estimate 

The tasks of the Language Arts Pedagogical Knowledge Assessment are administered 
in a large group setting using video tape monitors to present teaching segments to which 
teacher candidates provide written responses. The tasks can be administered by one or 
more persons with little or no training in the specific content of the assessment using 
procedures common to standardized group test administrations. 

Scoring requires the availability and training of raters knowledgeable in the content 
and criteria of the assessment. Scoring of the pilot test data, which included both training 
and actual scoring, involved two days for the six scorers to be trained and to score 42 
teacher assessments. We estimate that, once trained, a rater could score approximately 20- 
30 assessments/day, providing the assessment consists of a set of tasks similar to the three 
scenarios pilot tested. Using $160/day for an scorer's cost would result in an estimate of 
$6.40/assessment (160/25) for sc, *ing costs. We estimate that a two-day training session 
would be needed to train raters for this assessment. If we assume that 20 raters could be 
trained by one trainer in two days then the costs for 42 days of rater and trainer time at 
$160/day would equal $6,720. If it was assumed that each rater would participate in three 
days of scoring following this training then these training costs could be distributed across 
approximately 750 teacher candidates. The figure of 750 teacher candidates assumes that 
each of 20 scorers can rate 25 teacher candidates* responses each day and that each 
teacher's responses will be rated twice (i.e., 25 assessments/rater multiplied by three days 
multiplied by 20 raters equals 1,500). Dividing $6,720 by 1,500 results in an estimate of 
^5/teacher for training costs. This estimate could be increased or decreased as a function of 
the actual number of assessments that can be rated each day, the number of days an scorer 
rates after training, and the number of scorers trained at one time. But these figures are 
probably reasonable eiitimates of the costs for training and scoring assessments such as this. 
Combining $13 (i.e. $12.80 rounded off) for scoring and $5 for training costs results in an 
estimate of $18/teacher assessment for scoring. 

6.36 



9 

ERIC 



2B1 



As mentioned initially, the only special feature needed for administration is the video 
tape equipment. Adding $2 per assessment for these costs to the $30/assessment we have 
used for other similar administrations results in an estimate of $32/assessment for 
administration. A summary of cost estimates for administering and scoring an assessment 
like this is: 

Training and Scoring: $18/teacher 

Administration: $32/teacher 

Total Scoring and Administration $50/teacher 

Development and Pilot Testing Costs 

The costs for developing the tasks for this assessment were $115,528 and are broken 
out by major cost categories in Table 6.6, which also includes costs for pilot testing. These 
development costs are the expenses foi the assessment developer to deliver the prototype 
activities to the CTC and SDE in the form they were used in pilot testing. $37,614 was 
spent for the pilot testing of these tasks with the 42 teachers. It is likely that future 
development and pilot testing that build on these and involve larger numbers of teachers 
would result in greater efficiency and lower costs. That is, it would not likely require the 
same level of development to obtain other video-taped scenarios and revised assessments; 
but since it is not possible to estimate the costs more precisely at this point, these data 
should provide a rough indication of the magnitude of effort for developing similar 
assessments. 

Technical Quality 

This section describes the technical issues related to the assessment. 
Development 

The Language Arts Pedagogical Knowledge Assessment was developed by Northwest 
Regional Educational Laboratory during the nine-month period from July, 1989 to April, 
1990. The development team consisted of several staff members of Northwest Laboratory 
along with three outside consultants who are experts in the fields of assessment and/or 
language arts instruction. 



6.37 

2B8 



I 



TABLE 6.6 

DEVELOPMENTAL AND PILOT TEST COSTS FOR THE 
LANGUAGE ARTS PEDAGOGICAL KNOWLEDGE ASSESSMENT (LAPKA) 





Development 


• 


Staff-Salaries & Benefits 


$36,495 


$17,368 


Consultants 
rTflflchers assessors, 
and other consultants) 


14,400 


6,352 


Travel (Consultants and 
staff) 


15,117 


4,944 


Other Direct Costs (Site 
rental, phone, 
duplication) 


28,649 


1,070 


Total Direct Costs 


$97,662 


$29,734 


Indirect Costs 


17,866 


7,880 


Total Costs 


$115,528 


$37,614 



eric 



6.38 



2V>3 



In their initial meetings, the development team identified four scenario topics for 
videotaping: (1) A teacher-led, direct instructional activity, (2) an individual writing 
conference, (3) a small group, literature-based activity, and (4) a whole class discussion of a 
reading project. In addition, the development team decided to vary the grade level and 
student population across these scenarios, and that the teaching portrayed should model 
good practice. 

In a subsequent session, the development team met with a group of practicing 
teachers who were identified as potential demonstration teachers 4 o be videotaped for the 
scenarios. Th* se teachers reviewed the scenario design plans and offered recommendations 
for developing the actual videotapes, but only one of the teachers agreed to be videotaped. 
Through additional recruitment efforts, the developers located teachers in San Jose and 
Oregon who were willing to be videotaped as demonstration teachers. 

After the videotaping was completed in January, 1990, the development team 
reviewed the videotapes and edited them for assessment purposes. Contrary to the 
developers initial plans, however, the videotapes that were actually made, in some instances, 
did not reflect good teaching practices. In the final phase of the project, the development 
team prepared assessment questions and scoring criteria for each scenario. 

Reliability 

The following analyses were performed on the pilot test data of 42 teachers. 
Interrater agreements were examined to assess the degree to which scorers were able to 
consistently judge candidates using the LAPKA scoring protocols for Scenarios 1A, IB and 2 
for which two raters rated each teacher candidate. Internal consistency estimates were 
generated to assess the degree to which scenario ratings would form a measure and the 
degree to which the different activiths related to each other and might form an overall 
assessment. 

Interrater agreement!. Figure 6.1 contains a summary of the agreement between 
raters on Scenarios 1A, IB, and 2, for which there were two ratings. The agreements were 
modest on Scenario 1A where 57% of the candidates were assigned equivalent scores within 
the 13-point range for that scenario. Scenario IB has a range of only four points. Thirty- 
five percent (35%) of the candidates were assigned the same ratings and 47 percent of the 
ratings differed by one point on Scenario IB. Scenario 2 has a score range from 0-35 points. 
Only 23 percent of the candidates were assigned scores within three points. Forty-two 
percent differed by seven or more points. These results support the need for additional 



6.39 

2fM 



FIGURE 6.1 

% Agreement of Raters for the Language 
Arts Pedagogical Knowledge Assessment 



Percent of Teachers 



100% 




42% 

35% z.— 7i 



Scenario 1A Scenario 1B 
13 points 4 points 
possible possible 



0-3 4-6 7-17 



Scenario 2 
35 points 
possible 



revisions and development in the scoring system. The scorers and teacher candidates 
provided helpful feedback on concerns with the current scoring system and FWL has taken 
these and our evaluations into account in making suggestions for improvements and next 
steps. Mary of these have been described earlier and will be highlighted in the summary 
and conclusions section which follows. 

Interrater correlations. The correlations between raters are summarized below: 

Rater Pair Averaged 
Scenario 12 3 Pair Ratings 

1A .22 .91 .16 .56 

IB -,25 -- .00 NA 

2 .46 .42 .46 .45 

Interrater agreements were low and the ratings on IB precluded computing 
correlations for two rater pairs, e.g. in Rater Pair 2 one rater assigned all teachers the same 
rating for the five teachers s/he rated. Given the small N's and needs for further 
development in the scoring system, these estimates should be interpreted as only 
preliminary and lower bound estimates of the potential for obtaining consistent ratings with 
this assessment prototype. 

Internal consistency of the scenarios . Coefficient Alpha reliability estimates were 
calculated for the scenarios and are presented below: 

Scenario Reliability 

1A .45 

IB -1.09 

2 .25 

3 .53 

These estimates are also quite low and imply that the ratings conducted in this pilot 
test achieved little internal consistency. This is probably attributable to several factors. 
One of which is the small sample size associated with the pilot test. But equally important 
are the issues related to the factors described earlier. For example in Scenario IB the 



6.41 

20? 



"negative" reliability derives from the fact that two items negatively correlated with the 
third on a three item task. The interrater and internal consistency data provide empirical 
evidence of the need for further development, refinement, and research before this 
assessment would be ready for large scale administration or field testing. 

j ntercorrelF ti pn q flBlSHg tasks . The data above on reliabilities provide evidence to 
predict that there will not be a very high correlation among the scenarios given the low 
reliabilities of the current measures. For informational purposes we have listed correlations 
among tasks below: 

IA 2 3 ffl 2 3 

IA - IB - 

2 -.20 ~ 2 .23 - 

3 -.12 .41 - 3 -.12 .41 

The .41 correlation between Scenarios 2 and 3 is the only statistically significant 
relationship and this is partially accounted for by the fact that it is based on the 39 
teachers, whereas correlations involving Scenarios 1A and IB were based on only half the 
teachers. 

Validity of Agreement Through Group Comparisons 

Appendix E contains means, standard deviations, and related statistics for teacher 
candidates by several variables (i.e., gender, number of courses, grade level taught, teaching 
location: urban-innercity-suburban, and minority-nonminority). There were no notable 
trends in these data. This is *gain a combination of the fairly low reliability of the 
measures and small numbers in some groups. For example, the similarity in minority and 
nonminority teacher performance in this pilot test is simply an indication of the 
assessment's weak ability to reliably assess teacher candidates. 

Content Validity 

Evidence of the content validity of the assessment comes from three sources. The 
first is the role that the developers and the experts in the nelds of assessment and language 
arte instruction, had in designing the assessment. The second is the analyses of the match 



6.42 



ERIC 



2K3 



of the assessment to the model curriculum guide and California Beginning Teacher 
Standards that compare the assessment's content with that recommended in the official 
documents. The third is in the type of concerns raised by the beginning teachers who 
participated in the pilot test. These have been described earlier and implications for further 
development are addressed in the following section. 

In summary, the current scoring criteria and training with this pilot test sample did 
not produce sufficient data on which to judge accurately the technical quality potential of 
this prototype assessment. Perhaps, with revisions in, and further development of, the 
scoring criteria, future pilot tests could yield data that will support the technical quality and 
potential of the assessment. 

Conclusions and Recommendations 

Language arts instruction, the focus of this assessment, is central to elementary 
school teaching. As an instrument for measuring teachers' pedagogical content knowledge 
in this area, LAPKA has some strengths as well as some serious weaknesses, particularly in 
the area of scoring. In light of these strengths and weaknesses, FWL offers a number of 
recommendations for modifying this instrument. 

Administration of Assessment 

LAPKA is relatively easy to administer and can be efficiently given to large groups of 
teachers. Other than the need for videotape players and monitors, the requirements for 
administration are similar to other large-scale, paper-and-pencil tests. LAPKA requires no 
specialized knowledge from test administrators. 

LAPKA consists of four different scenarios, each of which requires approximately an 
hoir <wd a quarter to complete. For the pilot test, candidates took three of four scenarios 
dun. g a four-and-a-half hour administration session. Based upon the candidates feedback, 
we recommend the following: 

• Limit the assessment to no more? than three hours. Many candidates 
indicated that fatigue began to interfere with their performance. 



6.43 

26a 



Orientation materials for this pilot test consisted of a two-page description of the 
assessment. Based on teachers' feedback about these materials, we recommend the 
following: 

• Provide candidates with more detailed orientation materials, including 
a description of a sample scenario, sample questions, and examples of 
the scoring criteria. In addition, it might be appropriate (and more 
efficient) to provide candidates with the literature selections ahead of 
time if advance knowledge of the text would not interfere with the 
goals of the assessment. 

Based on projected cost-per-candidate, LAPKA is relatively inexpensive to 
administer. However, each time that LAPKA is administered, a new set of videotapes and 
scoring criteria will need to be developed, which will add to the overall cost-per-candidate. 
Therefore, we recommend the following: 

• Limit the number of administration dates to minimize ongoing 
development costs. 

Ensuring that all candidates are given adequate opportunities to demonstrate what 
thoy know and can do is one critical feature of an assessment; employing scorers who 
understand and can recognize diverse ways of "good teaching" is another critical aspect of 
creating a fair and effective assessment. Thus, we recommend the following: 

• Recruit scorers who are recognized for their teaching excellence in 
language arts and who are knowledgeable of the California frameworks 
and curriculum guides in English-language arts. 

• Recruit scorers who have knowledge of and experience in multiple 
cultural settings. 

Based on the scorers' feedback, we recommend the following changes to the scorer 
training program: 

» Lengthen the time for training from the present six hours to a two day 
session; 



6.44 



270 



• Increase the amount of practice materials for scorers to use during 
training, with examples of poor and acceptable responses, and a range 
of responses that reflect diverse effective practices; 

• Allow teachers to view the videotapes from beginning to end before 
they actually begin to score the practice materials. 

Assessment Content 

The conclusions presented below are based on feedback collected from the pilot test 
teachers, scorers, and a consultant to the project on issues of diversity and equity, along 
with the observations of FWL staff. 

• Overall, LAPKA's congruence with the California English-Language 
Arts Model Curriculum Guide for Kindergarten through Grade Eight is 
good. While LAPKA does not address all of the guidelines, it portrays a 
variety of reading, writing, and oral language activities, all of which are 
consistent with the curriculum guide. 

• Coverage by LAPKA of the California SUaidards fcr Beginning 
Teachers is limited. A small number of standards are addressed, and 
thece standards are addressed only indirectly because LAPKA does not 
provide direct evidence of a teacher's ability to perform in the 
classroom. Adding questions that specifically address the standards 
could improve the information for the assessment to better address the 
standards. 

• LAPKA appeared to be moderately difficult for beginning teachers. The 
candidates correctly answered from 57% to 87% of the questions in 
each scenario. The candidates had the most difficulty with Scenario 2, 
whhh also had the greatest number of confusing questions. 
Eliminating or rewording these questions would possibly improve 
candidates' scores. The range in candidate scores suggests that this 
exercise might be effective in discriminating weaker from stronger 
candidates in the areas assessed. 



6.45 

271 



• Teachers and scorers thought that the assessment was fair to teachers 
of different grade levels. An analysis of the perfonnance data indicates 
no significant differences between teachers who taught in grades K-3 
and 4-8. 

• The LAPKA scenarios portray students of diverse cultures and various 
ability levels. Although most of the teachers perceived this assessment 
to be appropriate for teachers of diverse student groups, some 
expressed concern about the lack of portrayal in the scenarios of 
bilingual classes, special education students, and limited English 
speakers. In addition, as pointed our by our consultant on cultural 
diversity, LAPKA tends to reward only certain ways of teaching and, as 
a result, has the potential of discriminating against certain teaching 
techniques deemed effective in some minority communities. Thus, it is 
the conclusion of FWL staff that, as presently constructed, LAPKA may 
not be effective for assessing teachers across contexts and with diverse 
students. 

• While most teachers though LAPKA was fair to different groups of 

te adhere, this view was not shared by some of the teachers, the scorers, 
and the consultant on cultural diversity. One concern was the lack of 
teacher diversity shown on the videotapes (e.g., few minorities, no 
males). Another concern was that the LAPKA scoring process fails to 
recognize a range of culturally diverse ways of teaching. 

• The performance data indicated no significant differences in candidate 
performance based on gender, ethnicity, grade level taught, California 
or non-California teacher training program, or the number of reading 
methods courses taken. 

• The majority of teachers and scorers felt that LAPKA is H2t an 
appropriate way to assess teachers. While they offered a variety of 
reasons for this view, the most frequent objection cited was that 
LAPKA does not assess teachers' ability to teach, instead it assesses 
their ability to evaluate other teachers. 

Based upon the teachers' and scorers' comments, as well as our review of the content 
of the assessment, we recommend the following changes to the individual scenarios: 

6.46 



272 



• Reconceptualize the writing activity in Scenario 2. The teachers and 
scorers felt that the individual writing conferences in this scenario were 
quite unrealistic. In a class of twenty-five to thirty students, teachers 
rarely get the opportunity to sit down with individual students and 
hold extended conferences about a particular piece of writing. 
However, teachers do respond in person to students individudly-but 
usually in small group settings or at brief, desk-side interviews. In 
small, peer-editing groups, for example, students (and the teacher, if 
present) respond to the possibilities and problems in a student's draft. 
In fact, some teachers argue that this small-group context is not only a 
more efficient way to respond to individual students, but a more 
effective one as well. 

• Begin Scenarios 1A and IB with the videotape teacher presenting her 
goals for ins traction-- as was done with Scenario 3. Scenarios 1A and 
IB differ from Scenarios 3 and 4 in several significant ways. Scenarios 
3 and 4 begin with the teacher explaining her goals for instruction to 
the viewer, while Scenarios 1A and IB begin with the actual lesson. 
Given that many teaching practices can be appropriate or inappropriate 
depending upon the teacher's goals, contextual factors, etc., it seems 
particularly important for the viewer to be aware of some of these 
features-in particular, the teacher's goals for the lesson or unit. It is 
much more difficul to evaluate a teacher's instruction without knowing 
that teacher's goals for the lesson or unit. In addition, for all of the 
scenarios, it might also be productive to provMe the viewer with some 
information about v he students, the course, etc. 

Assessment Format 

The LAPKA format has a number of strengths: (a) The video stimulus can present 
the contexts and complexities of teaching in ways that paper-and-pencil formats cannot; (b) 
The open-ended question format, in which candidates construct brief written responses, 
gives candidates the freedom to focus on particular aspects of the videotaped teacher's 
performance as well as the responsibility for appropriately framing their responses, and (c) 
Requiring candidates to give rationales for many of their responses allows a candidate's 
reasoning to be taken into account in scoring his or her responses. 



6.47 

273 



Despite these strengths, however, the format of LAPKA could be improved in 
numerous ways. First, based upon the perceptions of the teachers, scorers, and FWL staff, 
we believe that consideration should be given to the following recommendations: 

• Prior to showing a video segment, provide information to the viewers 
about the length and content of each segment. 



Improve the clarity of the scenario questions which caused confusion 
for the pilot test participants. 

More clearly label the spaces provided for candidates' responses. 

Review all LAPKA questions so that only those questions which are 
strictly dependent on viewing the videotape are included. 

• Ccnsider including only those questions which ta\e full advantage of 
the videotape sti aulus and cannot easily be answered through some 
other assessment format. 

• Reduce the length of the assessment by reducing the length of the 
videotape lessons, showing fewer writing conferences, asking fewer 
questions, and eliminating any redundant questions. 

Finally, to improve LAPKA's format, consideration needs to be given to substantially 
cha tng LAPKA's scoring process. In response to the reservations expressed by the 
sco -s, the consultant on cultural diversity, and FWL staff about LAPKA's scoring criteria 
and procedures, we recommend the following: 



Broaden the range of responses considered "acceptable" in the scoring 
key, especially taking into account culturally diverse conceptions of 
teaching. This might be achieved by reviewing the pilot test responses 
and adding additional rounds of pilot tests, particularly with minority 
candidates, to build a more comprehensive picture of the range of 
acceptable responses; 

Take a candidate's teaching framework into account as explicitly as 
possible in scoring a candidate's responses. This might be achieved by 
ha^ng candidates write a brief statement of philosophy or goal 



6.48 



9 

ERIC 



271 



statement for each scenario (e.g., views on the teaching of writing), and 
by having candidates provide rationale statements with their responses 
(done to some extent in the present version of LAFKA). 

• Move from the present analytical scoring system to a more holistic one 
that would allow scorers more flexibility to look across a candidate's 
entire performance and that would enable scorers to consider a 
candidate's responses in light of the candidate's rationales and 
philosophy. This shift would require a number of changes in the 
scoring system (e.g., document their reasons for assigning a particular 
rating). 

In addition to the major changes described above, we believe that the scoring format 
would be further improved by the following these recommendations: 

• Provide an explicit rationale for why some responses are acceptable 
while others are not; 

• Provide a rationale for why some responses are awarded two points, 
other responses one point; 

• Develop a plan for combining scores across scenarios to determine a 
total LAPKA score; 

• Provide a tentative estimate describing the relationship between 
candidate scores and levels of proficiency in teaching elementary 
language arts (e.g., weak, adequate, exemplary). 

Summary 

LAPKA, an assessment of pedagogical content knowledge in the elementary language 
arts, is easy and relatively inexpensive to administer, captures some of the complexities of 
teaching through its video format, and assesses candidates' knowledge on a wide range of 
language arts practices. The present scoring system, however, requires substantial 
revisions. In particular, the scoring system needs to be more sensitive to the context* 
dependent nature of instructional decision-making and to conceptions of teaching different 
from those portrayed in the videotapes. 



6.49 

275 



CHAPTER?: 
SECONDARY ENGLISH ASSESSMENT 



CHAPTER 7: 



SECONDARY ENGLISH ASSESSMENT 



One of the more innovative assessment prototypes pilot tested, the Secondary 
English Assessment was developed at San Francisco State University, San Francisco. **" e 
prototype can truly be described as an alternative performance-based assessment as it 
includes four very different activities, each of which requires the teacher to demonstrate (or 
"perform") a different skill or ability. 

Although all of the activities are designed to assess a teacher's approach to language 
and literature learning in a multicultural, multilingual context, the first three activities are 
conducted during a half day at an assessment center, and the fourth takes place over a 
three-month period in the teacher's classroom. The assessment center activities a<*k the 
teacher to demonstrate performance abilities in reading, writing, speaking, listening, 
responding to literature, evaluating student writing, and explaining language concepts. 

• Activity A, Responding to Student Writing - (approximately 50 minutes) 
This activity focuses on the teacher's skill in responding to student 
writing in a particular context. The teacher is given two samples of 
student writing and, for each sample, is asked to 1) write directly on the 
student sample addressing the student writer, and 2) analyze the 
student's text on a separate evaluation form, writing for a peer 
audience. Upon completion of the activity, which can be administered 
by a proctor, the teacher's responses are evaluated by at least one 
assessor. 

• Activity B, "FuhbowT Discussion of literary Work ~ (approximately 50 
minutes) In this group exercise, the teacher is asked to demonstrate 
his/her skills of literary interpretation and collaborative learning. To 
prepare for the activity, the teacher reads a designated short story, 
responds to the story in an informal log, and prepares questions for 
discussion. Then, at the assessment center, the teacher participates in 
an oral discussion of the story with three other new teachers who have 



7.1 



277 



read and prepared the same story. The activity is simultaneously 
administered and scored by two assessors. 

• Activity C, Speaking of Language - (approximately 50 minutes) 

This activity focuses on a teacher's skill in impromptu oral performance. 
Sitting on a panel with three other new teachers, the teacher is asked to 
give an impromptu oral presentation (approximately five minutes long) 
on a topic pertaining to language and literature in a multicultural 
society. The teacher makes the presentation in response to a question 
based on a set of readings provided for the assessment. After the 
presentation, the teacher answers one follow-up question posed by a 
fellow panelist. The activity is simultaneously administered and scored 
by two assessors. 

The fourth activity of the Secondary English Assessment is the Classroom Portfolio which is 
prepared by the teacher in his/her classroom. A description of this activity is as follows: 

• Activity D, Classroom Portfolio - (to be completed during a three-month 
period) This activity evaluates a teacher's skills in three areas: 
planning and implementing a teaching unit, responding to student work, 
and reflecting upon his/her experience in teaching the unit to gain 
insight for further teaching. The teacher plans and conducts a three- to 
six-week teaching unit in which the classroom activities are unified by a 
single focus. To document the teaching activities, the teacher compiles a 
classroom portfolio which consists of various interrelated components 
(e.g., weekly log, materials and assignments given to students, samples 
of student work with teacher responses). The completed portfolio is 
submitted to at least one assessor for evaluation. 

For each of the four activities, the .eacher's performance is evaluated at three levels: 
(1) according to specific criteria listed under a particular skill or category, (2) at the skill or 
category level, and (3) at an overall level. Figure 71, for example, is the response form used 
by assessors to evaluate a teacher's performance on Activity A, Responding to Student 
Writing. As depicted on the form, a teacher's performance is evaluated according to specific 
criteria corresponding to two skills or categories' ) Response Strategies, and (2) Analysis 
of Writer and Text The teacher is given a rating for each of the criteria in both categories, 
a rating for each skill or category, and an overall rating. As is also indicated on the form, at 
each level the teacher's performance is rated along a four-point scale, with a rating of "4" 

12 

2 7 S 



FIGURE 7.1 



Response Form A 

RESPONDING TO STUDENT WRITING 



Candidate's Name: 



Date / / 



Evaluaton. 



Part I. In comments addressed to student writer, the Candidate: 



A. Conveys reader's interest 
by responding to writer's 
meaning (content, ideas, information) 
not merely evaluating technique 

B. Demonstrates understanding of 
writer's goals arid purposes 

C . Responds in a way that would 
promote confidence in future 
writing attempts 

Provides accurate and useful 
feedback on technical aspects of 
the writing 

Makes helpful suggest ons for 
revision or future writing 



Very Very 
Strong Weak 

4 3 2 1 



4 3 2 1 



D. 



E. 



4 3 2 1 



4 3 2 1 NA 



4 3 2 1 NA 



OVERALL RESPONSE STRATEGIES 4 3 2 1 

Part n. In analyzing the student text for colleagues, the Candidate: 
A. Adequately describes writer's 





purpose/goals and method (identifies 














intended genre, task definition) 


4 


3 


2 


1 




B. 


Identifies effective features of the text:: 














a. content 


4 


3 


2 


1 


NA 




b. structure (focus, organization) 


4 


3 


2 


1 


NA 




c. development 


4 


3 


2 


1 


NA 




d. style/voice (syntax, diction) 


4 


3 


2 


1 


NA 




e. mechanics 


4 


3 


2 


1 


NA 


C. 


Identifies problematic features of the text: 














a. content 


4 


3 


2 


1 


NA 




b . structure (focus, organization) 


4 


3 


2 


1 


NA 




c. development 


4 


3 


2 


1 


NA 




d. style/voice (syntax, diction) 


4 


3 


2 


1 


NA 




e. mechanics 


4 


3 


2 


1 


NA 


D. 


Draws sound inferences about 














general strengths of writer 


4 


3 


2 


1 




E. 


Makes sound suggestions about what 














writer would benefit from learning 


4 


3 


2 


1 





OVERALL RATING 



4 3 2 1 



Evaluator 
Comments 



OVERALL ANALYSIS OF WRITER AND TEXT 4 3 2 1 

(Additional comments on reverse) 



Key to Rating Scale: 

4 = Definite Strengths in this area; 3- Some strength in this area; 2 = Lacks strength in this area; 
1 - Serious weaknesses in this area; NA = not applicable ui this instance. 



7.3 

27 1) 



indicating a very strong performance and a rating of T being a very weak performance. All 
of the ratings are made in a holistic manner and are not interdependent. That is, the 
category ratingB are not a composite or sum of the criteria ratingB, and the overall rating is 
not a composite or sum of the category ratings. 

Following the four evaluations, a summative evaluation called a Competency Profile 
is completed. The Competency Profile synthesizes the evaluations of all fo r activities and 
includes a recommendation regarding credentialing. 

The first three activities (i.e., Activities A, B, and C) were pilot tested during the 
period covered by this report (spring and summer, 1990). The Classroom Portfolio activity, 
which allows approximately three months for a teacher to plan and teach a unit, and 
compile the portfolio, is scheduled for administration in the fall of 1990. 

The administration, the content, and the format of the first three activities of the 
Secondary English Assessment are discussed below. The content and format sections of the 
report contain information from the teacher and assessor evaluation forms, as well as 
information and analysis of scoring results. Following these three sections are sections on 
cost analysis and technical quality. The chapter concludes with an overall summary 
together with recommendations for further steps in exploring the feasibility and utility of 
assessment activities such as these in California teacher assessment. 

Administration of Assessment Center Activities 

Beginning with an overview of the administration of the three assessment center 
activities of the Secondary English Asset ment, this section provides information on the 
following: logistics (e.g., identifying the teacher sample, scheduling the activities), security, 
assessors and their training, scoring, and perceptions of the assessment activities by 
teachers, assessors, and FWL staff. 

Overview 

The three assessment center activities of the Secondary English Assessment were 
administered on August 11, 13, and 14, 1990 from approximately 9:00 a.m. to 1:00 p.m. each 
day. Four trained assessors and a FWL staff person administered the activities in two 



7.4 



e^c 2 Mi 



conference rooms at a hotel in San Franr?co. Although 20 teachers were scheduled to 
participate, one teacher had to caned at the last minute due to personal problems. 

As shown in Table 7.1, of the 19 participating English teachers, the majority were 
Caucasian (non-Hispanic) females teaching at the high school level. An equal number (8) of 
teachers came from schools in northern and southern California, and three teachors came 
from schools in the central valley. Approximately two-thirds of the teachers were 
participating in the CNTP-sponsored teacher support projects. A little over one third of the 
teachers were teaching in inner city schools. 

Logistics 

Administration of the three assessment center activities entailed numerous 'ogistical 
activities, First, there were activities for all the assessments such as identifying a teacher 
sample, recruiting and training assessors (who also served as scorers), scheduling and 
making arrangements for the teachers to be assessed, sending orientation materials to the 
teachers, and acquiring evaluation feedback from the teachers and the assessors. In 
addition to the*e activities, there were two other important logistical activities relevant to 
this asser ment: recruiting trainers and developing the training for assessors, and extensive 
revision of the developer's original orientation materials. 

Identifying the teacher tempi? As mentioned earlier, Table 7.1 presents infor tion 
about the teacher sample for this assessment. As was the case with other assessments, it 
was necessary to recruit Non-Project teachers in addition to Project teachers in order to 
have a sample that represented different regions of the state. It was also hoped that by 
recruiting Non-Project teachers, the teacher sample might have a better representation of 
different ethnic groups; however, only one of the Non-Project teachers was identified as 
non-Caucasian. The Non-Project teachers, almost ail of whom identified themselves as 
suburban and urban teachers, were recruited by calling school districts and asking for 
names of first-year and second-year English teachers. All teachers, Project and Non-Project, 
were offered $300 to participate in the three assessment cent ictivities and to complete a 
portfolio. 



7.6 



TABLE 7.1 



PILOT TEST PARTICIPANTS 
SECONDARY ENGLISH ASSESSMENT 
(Number of Teachers =19) 







:.. Teacher. • . 






Ne^Wject 


Characteristics 


Northern California 


4 


4 


16 Caucasian, non- 
Hispanic; 
2 Hispanic; 
1 Asian or Pacific 
Islander 


Southern California 


u 


2 


6 Male; 13 Female 

12 High School; 

7 Junior High 


Fresno 


3 




6 Suburban; 

5 Urban (not inner 

city); 

"7 inner City; 
1 N/A 


Total Number of Teachers 


13 


6 





7.6 

ERIC 2b 2 



Recruiting and training of assessors. Four experienced high school English teachers 
were recruited and trained to serve as assessors and scorers for the Secondary English 
Assessment. Two of the teachers had previously administered and scored the activities 
during an August, 1989 workshop conducted by the developers. The other two teachers had 
no previous experience with the assessment. 

The four teachers attended a two-day training session given at FWL. The training 
was given by two trainers, both of whom were recruited by FWL staff from the pool of 
experienced English teachers who had previously administered the assessment in 1989. The 
trainers worked with a member of FWL staff in designing the training. (For more 
information, see the section, "Assessors and Their Training.") 

Scheduling/Arranging the assessments. The original 20 teachers were scheduled over 
three days. Because two of the assessment activities are administered to groups of four, the 
number of teachers scheduled for each day had to be a multiple of four. The final schedule 
was as follows: 8 teachers on Saturday, August 11; 8 teachers on Monday, August 13; and 4 
teachers on Tuesday, August 14. Whenever possible, the teachers were giveL a choice of the 
day for assessment. 

In addition to scheduling the teachers, sir and/or hotel arrangements were made for 
some of the teachers so that they could attend the San Francisco administration site. 
Arrangements were also made to reimburse teachers for assessment-related expanses (e.g., 
parking, air shuttle). 

Developing and sending the orientation materials. The developer provided the state 
with a package of orientation materials which were used in the developer's 1989 pilot test 
(i.e., the August workshop). These materials, however, were deemed by CTC and SDE staff, 
as well as by FWL staff, as in need of extensive revision for the 1990 pilot test. A member 
of FWL staff, with the assistance of CTC and SDE staff members, revised all of the 
orientation materials (as well as the assessment booklets used on assessment day) for 
Activities A through D. 

The revised orientation materials comprised an orientation handbook sent to all of 
the teachers before the administration of the assessment. The materials described each 
assessment activity (including the portfolio activity), the criteria by which the teachers 



7.7 



2S3 



would be evaluated, and any preparatory activities they needed to complete before 
assessment day (e.g., keeping a log, completing a practice assignment). The handbook also 
included all the reading materials needed for each assessment activity (e.g., a short story, a 
set of articles). 

Collecting evaluation feedback. FWL staff designed two evaluation feedback forms 
on which the teachers and the assessors could give their thoughts and opinions about the 
assessment. The teachers filled out the evaluation form immediately after they completed 
the three assessment activities. The assessors completed their evaluation forms on the last 
day of the assessment. 

Security 

For the developers of the Secondary English Assessment, test security for the 
assessment center activities during the pilot test primarily meant one thing: The group of 
teachers on whom the prototype was pilot tested had equal naivete about the test form. 
This security was achieved by the fact that not one of the teachers who participated in the 
pilot test indicated that they were familiar with the assessment in any way. 

In the future, if some version of the prototype were adopted by the state for licensure 
purposes, the developers suggest that "it would be essential to involve many people in an 
introductory training session to ensure equal understanding of and preparation for this 
alternative form of assessment." Furthermore, the developers advocate "a full scale 
dissemination effort to guarantee all teacher training programs equal accesr to models of 
the prototype, for integration into teacher preparation as befits the individual campuses/ 
(Both quotes taken from a March 6, 1990 letter written to the CNTP co-directors from one 
of the developers of the assessment.) 

Even if these measures are followed, however, the developers recognize that not all 
teacher candidates will receive the same quality training to prepare them for the 
assessment. It is for this reason that the developers designed the assessment so that the 
inequality of preparation could be offset by the provision of extensive materials and 
instructions for individual self-preparation. 

Other security measures to be considered if the prototype is adopted would be the 
collection and storage of the following assessment materials: (1) the assessment booklets 
for each activity, (2) the assessment response forms (i.e., the scoangsheete for each 
activity), and (3) any preparatory materials (e.g., reading logB) for the activities completed 

7.8 



9 

ERIC 



2S4 



by the teachers. These materials would need to be retained for a minimum number of 
years, enough to cover the period in which teachers could appeal decisions, or to meet 
statutory requirements. 

Assessors and Their Training 

As mentioned earlier, four assessors were trained to administer the Secondary 
English Assessment. This section describes some characteristics of the assessors, describes 
the training, and presents the perceptions of the training by the assessors and FWL staff. 

Characteristics of the assessor*. The four assessors trained to administer this 
assessment were all experienced high school English teachers who also had experience in 
formal writing assessment programs involving holistic scoring of writing samples (e.g., Bay 
Area Writing Project) and other language arts organizations (e.g., CLP, GATE). Two of the 
assessors had previous experience with the Secondary English Assessment and two did not. 
The two assessors who participated as assessors in the August, 1989 workshop conducted by 
the assessment developers were both males, one Caucasian, the other Asian. The two novice 
assessors were Caucasian females who had been recommended ' - one of the assessment 
trainers. (When administering the assessment activities, the ft assessors worked in teams 
of two, each team consisting of one male and one female, and one experienced and one 
inexperienced assessor.) All four "isessore were from northern California. 

Training. Although the August 1989 workshop conducted by the assessment 
developers included training of assessors, that training v?as limited as it did not have the 
benefit of videotapes of the activities, performance data on the teachers, and scoring data 
from the assessors. The August workshop supplied the above components, and, as a result, 
new training procedures and activities were developed. 

The two trainers for the assessment met with a FWL staff person for one day at the 
end of June to discuss the design of the new training. In addition to being asked to design 
the training so that it incorporated videotapes, performance data, and scoring data, the 
trainers were informed of the changes made to the orientation and assessment materials 
(i.e., assessment booklets), and that those changes would also need to be incorporated into 
the training. The trainers then met together during the summer and designed the training. 
They met again with the FWL staff person before the scheduled training to review their 
design package. 



7.9 

2S5 



The training was conducted at FWL in San Francisco on two days: Thursday, 
August 9, and Friday, August 10, 1990. The first day of training was devoted to Activity A, 
Beipondingto Student Writing. Because the administration of this activity did not require 
a trained assessor (i.e., the directions were self-administered), all four assessors were only 
trained to score this activity, not administer it. Training began with the assessors 
completing Activity A as if they were a candidate and then sharing and discussing their 
responses. Training continued with a discussion of the scoring criteria as listed on the 
scoring response form, and then practice in scoring teacher responses. Assessors scored 
three pairs of teachers' responses; after each pair they shared their ratings and discussed 
any differences and/or difficulties they may have had. 

On the second day of training, the assessors were divided into two groups: two 
assessors were trained by one trainer to administer and score Activity B, "FishbowT 
Discussion of literary Work, and two were trained by the other trainer to administer and 
score Activity C, Speaking of Language Training for each group began with a review of the 
activity and the activity's scoring criteria. The rest of the training for the activity consisted 
of watching videotapes of teachers participating in the activity, and then scoring the 
teachers' performances. After each round of videotapes and scoring, assessors tallied their 
responses, discussed differences, and came to a consensus. The training ended with the 
assessors reconvening as a group and discussing the overall assessment procedures for the 
three days of administration. 

Perceptions of training. When asked if the training they received was "very good," 
"adequate," or "insufficient," all four assessors responded, "very good." One assessor 
commented, "I felt tbe length, content, and format of the training were perfect!" Another 
assessor remarked, "Format and content seem exactly right " 

All four assessors described two aspects of the training as being the most useful: (1) 
practice in scoring sample responses, and (2) discussion of the scores and evaluation criteria. 
Other aspects of th* training that assessors found useful were the use of videotapes (one 
assessor described them as p roviding "excitement, interest, and reality"), the practice in 
doing Activity A, Eseponding to Student Writing; and the "establishment of the feeling that 
each assessor's opinion was va. ued." 

All four assessors also had suggestions for improving the training. The suggestions 
given and the number of assessors who gave them were as follows: 



7.10 



2m; 



Provide scoring guides that "would nail down the distinctions 
for each of the score points" (i.e., 4, 3, 2, and 1) 
(2 assessors) 

Provide more specific guidelines in the format and detail of 
writing comments on the scoring response sheets 
(1 assessor) 

Extend the training by one day or half a day 
(1 assessor) 

Based on our own observations of the training and on the performance data from the 
activities (which is discussed in the "Assessment Content" section), FWL staff agree that the 
training could be improved by following the above suggestions. In particular, the training 
should be revised to include an assessor's handbook which describes the scoring process in 
detail and specifically provides concrete examples whenever possible of (1) the distinctions 
between score points, and (2) the way in which comments are to be written on the scoring 
response sheets. By extending the training by a half day or more, both of these components 
could be addressed more thoroughly. 

Scoring 

The scoring system for the Secondary English Assessment is a holistic process which 
relies heavily on the assessor's professional judgement. Although teachers are rated along 
specific evaluation criteria described for each activity, these criteria serve solely as guides to 
help the assessor arrive at a holistic judgment for each skill or category and for the overall 
rating. 

The scoring process for this administration was conducted as follows. For Activity A, 
Responding to Student Writing; each teacher's responses were scored independently by a 
pair of assessors. The responses were scored in the afternoon, after the teachers had 
completed all of the assessment activities. For Activity B, "Flihbowl" Discussion of literary 
Work and Activity C, Speaking of Language, the teaches responses were scored during and 
immediately after the activity. Each teacher was independently scored by the pair of 
assessors who administered the activity. 



7.11 

2S7 



After every activity, as time allowed, the pairs of assessors discussed their scores to 
determine the degree of consensus. When differences were notable, they discussed their 
reasons for their scores and tried to achieve consensus; they did not, however, change their 
original scores. 

In the afternoon, after every teacher had been scored on all three activities, FWL 
staff reviewed the comments made by the assessors on the scoring response forms and noted 
which comments were inappropriate or insufficient. Assessors whose comments were too 
subjective, for example, were instructed to write future comments as objectively as possible. 
Assessors who gave lower ratings (i.e., a "2" or "1") but did not provide comments which 
explained the ratings were asked to provide such comments in the future. 

As the scoring procedures of this assessment are part of the assessment format, more 
information about the scoring procedures and the assessor's response to these procedures is 
provided in the section of this chapter titled, "Assessment Format." 

Teacher, Assessor, and FWL Staff Perceptions of Administration 

All 19 of the participating teachers expressed satisfaction with the arrangements 
(e.g., scheduling, room arrangements) made for the administration of this assessment. 
Comments about the arrangements ranged from "Excellent!" and "Great!" to "Very easy." 
One teacher who found the arrangements to be reasonable did add, however, that 
"unfamiliarity with the city created nervousner s, as did the need to get up early and process 
information." 

Although all four assessors also found the arrangements to be satisfactory, there 
were also some suggestions for improvement. One assessor suggested that the number of 
teachers assessed in Activity C, Speaking of Language could be increased from four to six. 
Another assessor wrote: 

A list of items to remember to do, step by step, for the assessor 
would make it more consistent Even with my own notes, I 
forgot to band out questions once! 

And still another suggestion was to tape record the oral presentations made in Activity C to 
help answer questions about performance. 



7.12 



In response to the above assessors' comments, it is our assumption that the first 
assessor's suggestion means that the two assessors who administer Activity C could just as 
easily score six teachers as four. Based on the scoring data for the activity which includes 
comments made by the assessors on the scoring response forms, FWL staff hesitate to agree 
with this suggestion. It is our belief that the scoring process for this assessment needs 
revision and it is unclear whether these revisions would make it easier or more difficult to 
score teachers' performances* 

As for the second suggestion, this information could be provided for every activity 
and should be included in the Assessor Handbook which we recommend be developed for the 
assessment. The tape recording of oral presentations could also be considered, especially to 
provide more examples during future training to represent the different ratings. 

The teachers and assessors were also asked to comment on the amount of time 
allotted for the administration of the assessment activities. Approximately two thirds of the 
teachers thought the time allotted for each of the activities was sufficient, and one third did 
not. All four assessors though the amount of time allotted fur the activity which they 
administered (i.e., Activity B or O was sufficient. Since the timing of the activities is also a 
feature of the assessment format, this issue will be discussed more completely in the section, 
"Assessment Format." 

Afsaasment Content 

The content of the three assessment center activities of the Secondary English 
Assessment focuses on a teacher's skills in the following areas: 

a) responding to and analyzing a student's writing; 

b) iiteraiy interpretation and group collar oration; and 

c) oral performance with regard to issues of language and literacy in a 
multicultural society. 

The three areas were deliberately chosen by the assessment developers to represent 
aspects of competence which are not now assessed or are under-assessed during the 
credentialing of English teachers. The three areas focus primarily on competence in content 
pedagogy and subject matter knowledge, as opposed to general pedagogical competence. 



7.13 




2S9 



General pedagogical competence is assessed, however, by the fc ivity, the Classroom 

Portfolio to be completed \n December, 1990. 

In addition to its focus, another important aspect of the assessment content is its 
context. As was mentioned earlier, the content for each of the three activities incorporates 
a multicultural, multilingual context. For example, in Activity A, Responding to Student 
Writing; the teacher is asked to (1) read ten samples of student writing from a tenth-grade, 
multi-ethnic English class for context, and (2) respond to two samples of student writing, 
one of which is written in non-Standard English. For Activity B, Tiii bowl" Discussion of a 
literary Work, the teacher is asked to read and discuss a short story written by an African- 
American author. Activity C, Speaking of Language, provides the teacher with a set of 
articles, taken from a variety of publications, about literature and literacy in the multi- 
cultural classroom to be read in preparation for giving an impromptu oral presentation on a 
related issue. 

In the following pages, the content of the Secondary English Assessment is discussed 
along the following dimensions: 

• Congruence with the California English/Language Arts Framework and 
Handbooks; 

• Extent of coverage of California Standards for Beginning Teachers; 

• Job-relatedness of the assessment activities; 

• Appropriateness for beginning teachers; 

• Appropriateness across different teaching contexts (e.g., grade levels, 
diverse student groups); 

• Fairness across groups of teachers (e.g., ethnic groups, gender); and 

• Appropriateness as a method of assessment. 

Congruence with the California English/Language Arts Framework and Handbooks 

FWL staff reviewed the Secondary English Assessment to see in what ways the three 
assessment center activities are congruent with California's English-Language Arts 
Framework. 1987. Because two of th«) assessment activities focus specifically on writing 
and literature respectively, we also looked at congruence of the assessment activities with 
California's Handbook for Planning an Ef fecti™ Writing Program. 1986, and Hafflfoook fpr 
Planning an E ffective Literature Program, 1988. 



7.14 



Table 7.2 describes the ways in which the different activities are congruent with the 
framework and handbooks. As is evident from the descriptions, all of the activities are 
congruent in some way with the framework and handbooks, but none of the activities are 
strongly congruent. Activity A, Responding to Student Writing; for example, addresses only 
on; stage in the development of a student's composition skills— i.e., the teacher's evaluation 
of the student's writing. It does not assess a teacher's skill in providing various writing 
opportunities, in helping students write for various audiences, in teaching students how to 
write for a purpose, to revise and edit theL- writing, etc.-all of which are discussed in the 
Handbook for Planning an Effective Writing Pro-am . One way in which the activity could 
be made more congruent; is if the activity were expanded to include questions which address 
how the teacher might, m addition to evaluating the student's first writing, lead the student 
through the various stages of writing. 

Similarly, Activity B, TiihbowP Discussion of literary Work and Activity C 9 
Speaking of Language, address only one component of an effective English-Language Arts 
program-i.e., modeling by the teacher of important English-Language Arts skills. 
Unfortunately, there is no guarantee that because a teacher's literary analysis or oral 
performance skills are strong, s/he can teach those skills to students. To make these 
activities more congruent would require substantial revision of the activities. Activity B, for 
example, might be revised so that the teacher views videotapes of students discussing a 
short story (which the teacher has read and then responded to in a reading log) and then is 
asked to critique the discussion along the dimensions of literary interpretation and group 
process. (If such a mcyor revision were not deemed acceptable, then at tht very least, the 
activity could be revised to include a question as to what oral language activities the teacher 
could conduct to help students better understand the meaning of the short story.) Activity 
C might be revised in a similar way: the teacher views a videotape of a student giving a 
speech and then is asked to comment on the student's content, organization, and delivery 
(which are the same criteria by which the teacher is currently evaluated when s/he gives the 
speech). Both of these revisions would result in more congruency with the state framework 
and handbooks because both would put a greater focus on a teacher's skill in responding to 
students' abilities versus demonstrating their skill in activities that have little or only an 
indirect relationship to teaching students. 



7.15 



ERLC 



29 i 



TABLE 7.2 



CONGRUENCE OF THE SECONDARY ENGLISH ASSESSMENT WITH THE 
ENGLISH-LANGUAGE ARTS FRAMEWORK AND HANDBOOKS 



Framework and 


Activity A. Responding 


1;. Acfivlt^ > 


Activity Speaking 


Handbook 


to Stiidem Wrltino 1 


discussion of Literary Work 


of Language 


Snghsh-Language 




Intoarates all elements 


Integrates all 


aks rrameworK, 


rtpv/olnrimpnt of £1 
UcvwiUfJi i loin \ji a 


of lanauaae (listenina, 


elements of 


1987, lv-l£ 




<*npakina rsadina. and 


language (listening, 






writing 


soeakinq, readinq, 








and writing). 






Addresses the issue 








of modeling, Teacher 


Addresses the 






models good listening, 


issue of modeling. 






valuing of ideas, and 


Teacher who can 








<*npak wpII 

O Uw U ■> null 






Questions. 


encourages 








students to use 








words well and to 








speak effectively. 


Handbook for 


Addresses 


Addresses the issue 


Fosters awareness 


Planning an 


development of a 


of modeling (reading, 


of society. 


Effective Literary 


student's 


writing, and listening 


Activity C articles 


Program, 1988, 


composition skills. 


skills). 


address issue of 


K-12 






literature in a 






Pnriic nn rpntral iq^iip** 

lUvUd Ull Owl III ul IgguwO) 


multi-cultural 






interoretation of 


society. 






symbols, discussion of 








meanhg t and argument 


Models oral 






of interpretation. 


language skills. 


Handbook for 


The topic selected 


Incorporates the use of 


Incorporates the use 


Planning an 


for the students' 


reading logos. 


of reading logos. 


Effective Writing, 


first writing samples 






Program, 1986, 


could be considered 






K-12 


appropriately 








motivating for a 








first draft. 








Extant of Coverage of California Standard! for Beginning Teachers 

The three assessment center activities of the Secondary English Assessment were 
examined by FWL staff to see how well they covered the California Beginning Teacher 
Standards which define levels of pedagogical competence and performance that California 
teacher credential candidates ore expected to attain (i.e., Standards 22 to 32). The 
standards are reprinteu • " ->w (in italics), along with an analysis of how the assessment 
activities correspond to each standard. 

Standard 22: Student Rapport and Classroom Environment Each candidate 
establishes and sustains a level of student rapport and a classroom environment that 
promotes learning and equity, and that fosters mutual respect among the persons in a class. 
This standard is addressed in a small way by Activity A, Responding to Student Writing 
which assesses a teacher's skill in responding to student writing "in a way that promotes 
confidence in future writing attempts." This standard is not addressed by Activities B and 
C. 

Standard 23: Curricular and Instructional Planning Skills. Each candidate prepares 
at least one unit plan and several lesson plans that include goals, objectives, strategies, 
activities, materials and assessment plans that are well defined and coordinated with each 
other. None of the assessment center activities addresses this standard. (It is, however, 
addressed by the Portfolio activity). 

Standard 24: Diverse and Appropriate Teaching. Each candidate prepares and uses 
instructional strategies, activities, and materials that are appropriate for students with 
diverse needs, interests and learning styles. In Activity A, Responding to Student Writing 
the teacher is asked to respond to writing samples from two different students, both of 
whom have different needs. The standard is not directly addressed by Activity B or C; 
however, Activity C, Speaking of Language indirectly addresses the standard through some 
of its questions which ask the teacher to explain his/her view about teaching literature in a 
multi-cultural classroom (e.g., teach "the Classics" or teach multi-cultural literature?). 

Standard 25: Student Motivation, Involvement, and Conduct Each candidate 
motivates and sustains student interest, involvement and appropriate conduct equitably 
during a variety of class activities. The motivation aspect of this standard is somewhat 
addressed by Activity A, Responding to Student Writing which asks the teacher to respond 
to student writing in "a way that would promote confidence in future writing attempts." 
This standard is not addressed by Activity B or C. 

7.17 

2f»3 



Standard 26: Presentation Skills. Each candidate communicates effectively by 
presenting ideas and instructions clearly and meaningfully to students. In Activity A, 
Responding to Student Writing the teacbor is assessed on h well s/he communicates to 
students (via written language) about their writing. Activities B and C do not address this 
standard. 

Standard 27: Student Diagnosis, Achievement and Evaluation. Each candidate 
identifies students' prior attainments, achieves significant instructional objectives, and 
evaluates the achievements of the students in a class. In Activity A, Responding to Student 
Writing; the teacher is asked to evaluate samples of student writing. Activities B and C do 
not address this standard. 

Standard 28: Cognitive Outcomes of Teaching. Each candidate improves the ability 
of students in a class to evaluate information, think analytically, and reach sound 
conclusions. This activity is not addressed by Activity A Activities B and C also do not 
directly address this standard, but it could be inferred that a teacher would be unable to 
improve the ability of students in a class to evaluate information, think analytically, and 
reach sound conclusions, unless the teacher him/herself could do so-ao is required by 
Activities B and C. 

Standard 29: Affective Outcomes of Teaching. Each candidate fosters positive 
student attitudes toward the subjects learned, the students themselves, and their capacity to 
become independent learners. This standard is addressed by Activity A, Responding to 
Student Writing; which asks the teacher to respond to student writing "in a way that 
promotes confidence in future writing attempts" (i.e., the student would want to continue to 
write and would feel good about him/herself as a writer). This standard is not addressed by 
Activity B or G. 

Standard 80: Capacity to Teach Cross-Culturally. Each candidate demonstrates 
compatibility with, and ability to teach, students who are different from the candidate. The 
differences between students and the candidate should include ethnic, cultural, gender, 
linguistic and socio-economic differences. This standard is not directly addressed by 
Activities A, B, or C. Indirectly, however, each of these activities touches upon this 
standard? Activity A requires the teacher to respond to a sample of student writing which is 
written in a non-standard dialect; Activity B requires the teacher to discuss a short story 
about African-American children written by an African-American author; and Activity C 
sometimes requires (depending on the question selected) the teacher to discuss issues about 
teaching literature in multi-voiced, multi-cultural classroons. A teacher's capacity to teach 

7.18 



ERIC 2»-l 



cross-culturally could be inferred from his/her response to the student writing sample, the 
short story, and the literature issue, but this is not directly addressed by the current scoring 
criteria. 

Standard 31: Readings* for Diverse Responsibilities. Each candidate teaches 
students of diverse ages and abilities, and assumes the responsibilities of full-time teachers. 
This standard focuses on a teacher's ability to teach classes which span the range covered 
by the credential (i.e., grades K-8 or 7-12) or students at two or more ability levels (such as 
remedial and college preparatory classes). None of the activities are designed to assess this 
ability. This standard also addresses a teacher's ability to fulfill typical responsibilities of 
teachers such as meeting school deadlines and keeping student records, none of which are 
assessed by any of the activities. 

Standard 32: Professional Obligations. Each candidate adheres to high standards of 
professional conduct, cooperates effectively with other adults in the school community, and 
develops professionally through self-assessment and collegia! interactions with other 
members of the profession. This standard is not directly addressed by any of the activities, 
although Activity B, "Fishbowi Discussion of a literary Work, does assess a teacher's ability 
to communicate and cooperate with others in the discussion of a short story. 

The extent of coverage by the Secondary English Assessment of the California 
Beginning Teacher Standards is summarized in Table 7.3. The table lists the assessment 
center activities which address each standard, and also describes the extent of coverage 
provided. 

Job-relate ones* 

The 19 teachers who participated in the three assessment center activities of the 
Secondary English Assessment were asked if the skill areas chosen for the activities (i.e., (a) 
responding to and analyzing a student's writing; (b) literary interpretation and group 
collaboration; and ,c) oral performance with regard to problematic issues in English 
education) are relevant to their job of teaching. All of the teachers but one responded 
positively, some offering comments such as the following: 

Resoundingly so! 

In fact, more relevant than most districts are willing to commit 
time to. 

7.19 

295 



TABLE 7.3 



EXTENT OF COVERAGE BY THE SECONDARY NGLISH ASSESSMENT 
OF CALIFORNIA STANDARDS FOR BEGINNING TEACHERS 



Standard ■;*■'>•"•■'"•'■ 


Assessment Center Activity 
■ Addressing Standards • » 


Extent of 
Coverage 


22: Student Rapport and Classroom 
Environment 


-Activity A, Responding to 
Student Writing 


Limited 


23: Curricular and Instructional 
Planning Skills 


-None 


None 


24: Diverse and Appropriate 
Teaching 


-Activity A, Responding to 
Student Writing 


Limited 


25: Student Motivation, 

Involvement and Conduct 


-Activity A, Responding to 
Student Writing 


Limited 


26: Presentation Skills 


-Activity A, Responding to 
Student Writing 


Limited 


27: Student Diagnosis, 

Achievement and Evaluation 


-Activity A, Responding to 
Student Writing 


Limited 


28: Cognitive Outcomes of Teaching 


-None 


None 


29: Affective Outcomes of Teaching 


-Activity A, Responding to 
Student Writing 


Limited 


30: Capacity to Teach 
Crossculturally 


-indirectly Addressed by 
Activity A, B, and C 


Limited 


31 : Readiness for Diverse 
Responsibilities 


-None 


None 


32: Professional Obligations 


-Activity B, "Fishbowi" 
Discussion of Literary Work 


Limited 



7.20 



9 

ERIC 



9o r 



A few teachers, however, qualified their "yes" answers by stating or suggesting that 
Activity C, Speaking of Language was not relevant. Commented one teacher, 

Except for Activity C. Interesting content and issues; however, 
how can this impromptu [speech] on issues assess oral 
performance in the classroom? Delivering such information and 
to colleagues (not students) is different than speaking to 
siudents. 

Another teacher remarked, 

I don't exactly see where Activity C fits into our daily activities, 
beyond the political question of being able to "defend" or lobby 
particular education theories. 

This latter remark was somewhat echoed by the one teacher who responded 
negatively to the question of job relevancy. This teacher found Activity C to be irrelevant 
because, "I don't deal with problematic issues except those I can control in my classroom. 11 

Of the four assessors, all found the three assessment center activities to be relevant 
to a teacher's job. One assessor emphatically wrote: 

Decidedly relevant! I would be very hesitant to hire a teacher 
who lacked the skills to do at least passably well on §U of these 
activities. The abilities to analyze and comment on student 
writing discuss literature, and to explain one's pedagogical 
philosophy and practice are crucial to teachers. 

Three of the assessors, however, judged Activity A, Responding to Student Writing to 
be the "most relevant" because, in the words of one assessor, it is "closer to actual classroom 
performance." Elaborating on the merits of Activity A, an assessor commented, 

[Activity A] assesses one of the most important and pervasive 
activities of English teachers, and the two essays are as diverse 
as possible: black/white, female/male, extremely/barely literate, 
and diverse in style. Responses to both give a good picture of a 
candidate's knowledge of both pedagogy and subject matter in 
the teaching of writing, 

7.21 

297 



FWL staff agree with most of the above comments: the three assessment activities 
do seem relevant to a teacher's job, Activity C, Speaking of Language does seem the less 
relevant, and Activity A, Raaponding to Student Writing the most. However, with regard to 
Activity B, "FiihbowT Discussion of literary Work and Activity C, Speaking of Language, 

we also tend to agree with a teacher who acknowledged the indirect job relevancy of the 
activities, but who added, 

They in no wav would assess our actual job of teaching a class. 

In other words, although Activities B and C are indirectly relevant to an English 
teacher's job, and, it could be argued, part of a teacher's job, they do not in any way assess 
how a teacher responds to and teaches students, which is the teacher's job. 

Appropriateness for Beginning Teachers 

The appropriateness of the three assessment center activities of the Secondary 
English Assesnment are discussed in this section from two perspectives: (1) the perceptions 
of the participating teachers and assessors, and (2) the teachers' performance on the 
assessment 

Perceptions. When asked if they had sufficient opportunity to acquire & knowledge 
and skills relevant to the activities in which they participated, the teachers' responses were 
mixed: 63% (12 of 19) said "yes," 21% (4 of 19) said "no," and 16% (3 of 19) said "yes and 
no." Several of the teachers who marked "yes," however, qualified their answer. These 
teachers said that they had sufficient opportunity to acquire the knowledge and skills 
relevant to the activities, but only because of their experience in the classroom. For 
example, one teacher wrote, 

Yes, but mainly because Tve had one year's expense*} Uready. 
Without that I don't think I would have done at* well 

Another teacher elaborated further, commenting both on her experience in the classroom 
and her lack of training at the university: 

Personally, I had nQ preparation for the assessment of my sMlk 
in responding to student writing during my university course 
work in teacher training. Now, having frught for two years, 
Tve developed my philosophy and approach with help from 

7.22 



ERIC 2;>S 



teachers I respect, But straight out of teacher training, lis 
[activity] would have been assessing something I hadn't been 
taught. 

S jme of the teachers who responded to the question with both a "yes" and "no" 
answer, also cited lack of training or exposure to the content of the activities as the reason 
for their dual answer. In explanation of the "no" part of her answer, one teacher 
commented on her lack of exposure to one part of the content of Activity C, Speaking of 
Language: 

The only reason I marked "no" is because I personally have not 
been exposed to the current debate: canon vs. multicultural 
literature. 

Similarly, another teacher explained her dual answer as follows: 

No, because my department, school and district spend almost 
no time dealing with these kinds of activities. Yes, because of 
my own interests. 

Finally, the majority of the teachers who responded "no" to the question also referred 
to their lack of experience in the classroom and lack of training in tht skills areas assessed 
by the activities. Remarked one teacher, 

As a new teacher, I think I need more experience in these three 
aieas to be considered "skilled.'' A college degree doesn't 
necessary give me the knowledge to teach nor does a teacher 
preparation program. I think a lot of knowledge tnd skill 
comes from experience. 

Two other teachers with no" answers zeroed in on their lack of experience/training in a 
particular assessment area: 

I need practice in the process of collaborative thinking and 
cooperative groups [Activity BJ, I have a general 
understanding, but lack experience in which to draw 
information from. I was taught to bean authority which is 
inconsistent wi ' collaborative learning. 

7.23 



21) {) 



I feel my responses to student writing [Activity A] and my 
dbility to be clear about issues in articles [Activity C] need lots 
of work. 

The four assessors were also asked if they thcugbt that a beginning English *achsr 
would have had an opportunity to acquire the knowledge and skills needed to respond ft* 
each activity in an adequate manner. None of the assessors gave an unqualified "yes"; three 
qualified their responses, and one gave a dual answer much like that of some of the 
teach3rs. The major qualification linking all of the assessors' answers was that of the 
quality of the teacher's preparation program. One assessor remarked, 

They should have had opportunity to develop these skills and 
knowledge-if they haven't, I think their deficiencies are an 
indictment of teacher training programs. Teachers should be 
trained to respond to student work, analyze literature, and read 
and synthesize research. 

Said another assessor more succinctly, 

Yes, if they have gone through a good teacher training program. 
No, if they haven't. 

Offered another assessor, 

Some teacher training programs obviously don't require the 
knowledge and skills necessary, but thafs the fault of the 
program, not the assessment. Maybe one use of [the 
assessment] is to evaluate these programs, 

In conclusion, while a slight majority of the teachers believe they have had the 
opportunity to acquire the skills and knowledge measured by the assessment activities, 
many of the teachers and all of the assessors are not so sure. The dissenting teachers and 
those expressing uncertainty explained their answers by citing lack of experience in the 
classroom and/or a lack of training in the activities skill areas. The assessors justified their 
answers by focusing on the possibility of a teacher not having had a good teacher 
preparation program (i.e., one that trained the teacher to respond w student work, analyze 
literature, and read and synthesize research). 

7.24 



ERIC 3,0 



Performance on assessment FWL staff analyzed the teachers' performance on each 
of the three assessment activities to see if the beginning teachers participating in this 
assessment had acquired the knowledge and skil;" measured by these activities. Specifically, 
FWL staff looked at the teachers' overall ratings for each activity, as well as the 
corresponding skill or category ratings. Because each teacher was rated by two assessors, 
the ratings from both assessors were included in the analysis. Although the rating scale 
included four possible ratings, ranging from a high of "4" to a low of "1", the ratings were 
not designed with pass/fail characteristics. For our purposes, however, we interpreted the 
"4" and "3" ratingB (4 = definite strengths in this area; 3= some strengths in this area) as 
"pass" ratings, and the "2" and T ratings (2 = lacks strength in this area; 1 = serious 
weaknesses in this area) as "fair ratingB. 

Table 7.4 shows the number of teachers receiving each rating in the evaluation 
categories (including "overall") for each activity. In the first activity, Activity A, Responding 
to Student Writing; approximately 63% (12 of 19) of the teachers clearly passed (i.e., they 
received an overall rating of "3" or "4" from two assessors), and 16% (3 of 19) clearly did not 
pass (i.e., they received an overall rating of "2" from two assessors). The remaining 21% (4 
of 19) of teachers were given a "2" rating by one assessor, and a "3" rating by another. None 
of the teachers received an overall rating of "1." 

In the second activity, Activity B, "Fuihbowi" Discussion of literary Work, 89% (17 of 

19) of the teachers clearly passed, and none of the teachers clearly failed. Two teachers 
received mixed ratingB: one teacher received a "3" and a "2"; another teacher received a "4" 
from one assessor and no rating from the other assessor. None of the teachers received an 
overall rating of "1." 

In Activity C, Speaking of Language, 68% (13 of 19) of the teachers clearly pas«2d, 
16% (3 of 19) clearly did not pass, and 16% received a mixed rating (i.e. a "3" and a "2"). 
None of the teachers received an overall rating of "1." 

Overall, then, at least half of the teachers clearly passed each activity, with the 
greatest number of teachers passing Activity B, "FSihbowT Discussion of literary Work. 
Moreover, approximately 79% (15 of 19) of the teachers clearly passed at least two of the 
three activities, with 42% (8 of 19) of the teachers passing all three activities. 



7.25 

•301 



TABLE 7.4 

THE NUMBER OF TEACHERS RECEIVING EACH RATING IN THE 
EVALUATION CATEGORIES FOR EACH ACTIVITY 



ACTIVITY A, RESPONDING TO STUDENT WRITING 





Clearly Passed 


y^xyxy^-xyiv:;.;:;:;;;:'-:.:.:.:;:;: . 


Clearly Did 
Not Pass 


Other 


Response 
Strategies 


1 6 4 


6 


2 




Analysis of Writer 
to Text 


0 6 3 


3 


3 


4 (missing 
a rating) 


Overall 


1 10 1 


4 


3 




ACTIVITY B, "FISHBOWL* DISCUSSION OF LITERARY WORK 


• • : • • • v;:.:y : • ': ';■ 

Iv^Mon Categories 


IPS j|l 




Cieariypjd 
Not: Pass : 


Other 


Interpretative 
Process 


9 5 1 


1 


1 


2 ("2" & "4") 


Group Process 


5 8 5 


1 


0 




Overall 


8 7 2 


1 


0 


1 (missing 
rating) 


ACTIVITY C, SPEAKING OF LANGUAGE 


ifj |j 1 1 

Evaiuatfon C«€iiw1e8 


I^ii^j%s§etf ; 


•SH ill 

' . ' • : ' X : -: .' 

: : :•: : :^v;v: : : : : ; : : : : ; : :^^•'^^;:;■:;;:;^::;:•x 


pearly Old 
: Not Pass ; 

,.A: : :;'y,,:-:W2# : -. ; ' 


Other 


Content 


6 4 3 


3 


3 




Plan 
Delivery 


3 7 3 
5 7 5 


2 
2 


3 
0 


1 (missing 
rating) 


Overall 


4 6 3 


3 


3 





9 

ERIC 



7.26 

3 i.- <J 



The performance data discussed above seems to support the teachers' perceptions 
that they have had the opportunity to acquire the knowledge and skills measured by the 
assessment activities, at least for Activity A, Responding to Student writing and Activity C, 
Speaking of Language. Approximately 63% of the teachers indicated that they have had the 
opportunity to acquire the necessary skills and knowledge to pass the activities, and 63% 
and 68% of the teachers respectively clearly passed Activities A and C. The teachers did 
much better on Activity B, "Fuhbowi" Discussion of literary Work with over three-fourths 
of the teachers clearly passing. This marked increase in performance is not surprising, 
however, since some form of literary analysis is usually taught beginning in the 7th or 8th 
grade (if not before), and talking about books with others is also often an integral part of 
the curriculum beginning about then. 

The performance data also support the teachers' responses to the question of which 
of the assessment activities, if any, could be passed after student teaching and before 
teaching a classroom of their own. Approximately 89% of the teachers (17 of 19) named 
Activity B, Tishbowi" Discussion of Literary Work. Activity A, Responding to Student 
Writing and Activity C, Speaking of Language were named by less than 50% of the teachers 
(8 and 7 respectively). Similarly, when asked which of the activities, if any, could only be 
passed by teachers with more than two years experience in the classroom, five teachers each 
(or 26%) named Activity A and Activity C, while only two teachers (10%) named Activity B. 

In conclusion, the analysis of the teachers' performances on the three activities 
seems to suggest that Activity B, "FuhbowT Discussion of literary Work may be more 
appropriate for beginning teachers than Activity A, Responding to Student Writing and 
Activity C, Speaking of Language primarily because beginning teachers seem to have had 
greater opportunity to acquire the knowledge and skills measured by Activity B. 

Appropriateness across Contexts 

In order to determine if the teachers and assessors believe the three assessment 
center activities of the Secondary English Assessment are appropriate for teachers across 
contexts, we specifically asked them to comment on the assessment's appropriateness for 
teachers of diverse students groups (e.g., different student ability levels, different ethnic 
groups, handicapped or limited-English students, different school/community settings). 
Approximately 89% (17 of 19) of the teachers responded positively to the question; one 
teacher disagreed, and one teacher was undecided. Of the teachers who responded 
positively, one teacher affirmed, 



7.27 



They are skills that are needed no matter the particular 
circumstance of the teacher. 

The teacher with the dissenting opinion remarked that "only Activity A was 
appropriate," because the other two activities put the teacher in the position of the student 
(e.g., discussing literature in a group) rather than teaching the student. 

The following sections discuss the issue in more detail with respect tc teachers of 
different grade levels and of diverse student groups (e.g., ethnic diversity). 

Grade level. In this assessment pilot test, none of the teachers or assessors made 
any reference to the inappropriateness of the assessment for teachers at different grade 
levels. 

Analysis of the rating results, however, indicate that there may be some differences 
among teachers of different grade levels according to the activity. For example, of the three 
teachers who clearly did not pass (i.e., received a "2" rating from two assessors) Activity A, 
Responding to Student Writing; all taught at the junior high/middle school level. One other 
middle school teacher received a "3" and a "2" rating for this activity. Thus, of the seven 
junior high/middle school teachers participating in Activity A, more than half did not clearly 
pass. 

In Activity B, "FiibbowT Discussion of literary Work, no teacher clearly did not 
pass, and one high school teacher received a "3" and a "2" rating. In Activity C, Speaking of 
T-atigimg a three teachers clearly did not pass, two of whom were middle school teachers. Ir 
addition, three teachers received a "3" and a "2" rating, two of whom were middle school 
teachers. 

Thus, the performance data on the teachers seems to suggest that junior high/middle 
school teachers may be less well prepared than senior high teachers for Activity A and 
Activity C. (It is hard to imagine why this is so, however, unless, in general, the more 
skilled secondary teachers gravitate toward and are hired at the high Sv hool level.) 

DivexM students. As mentioned above, a clear majority of the teachers believe the 
Ldsessment is appropriate for teachers of diverse student groups. One teacher added, 
however, 



7.28 



The classics would be impossible to teach ESL students. The 
story couid be told, but the reading could not actually happen. 

The teachers' belief that the assessment is appropriate for teachers of diverse 
student groups is strengthened by the fact that they are all teachers of diverse student 
groups. For example, all of the teachers who participated in the assessment taught in 
classrooms where at least some students spoke a language besides English. In addition, 
more than 50% (11 of 19) of the teachers taught in classrooms where four or more 
languages were spoken. 

The assessors were also asked to address the issue of student diversity, but in a 
slightly different way. The assessors were asked to comment how the assessment activities 
address a beginning English teacher's ability to work with diverse students. All of the 
assessors agreed that Activity A, Responding to Student Writing; does a good job of 
addressing this issue. Commented one assessor, 

Activity A definitely addresses the candidate's ability to 
effectively communicate with students of very different 
backgrounds and writing abilities. 

The assessors were more mixed in their comments about Activity B, "FuhbowT 
Discussion of literary Work and Activity C, Speaking of Language. Three of the assessors 
tended to agree with the following remark made by one assessor about Activity B: 

Activity B can reveal something about the teacher's abilities in 
this area if the story used for the activity is one that requires 
cross-cultural knowledge to be understood, as is the case with 
"The Lesson." 

The assessors agreed that "The Lesson," the story used in this pilot test, does a good job of 
raising issues relevant to cultural diversity. One assessor, however, disagreed with the 
other three, stating that Activity B is not a suitable way of assessing a teacher's ability to 
work with diverse students. 

Activity C, Speaking of Language was also peiceived by three of the assessors as 
being able to "give some indication of a teacher's ability to work with divei-se students" 
because of the focus of the readings and the topic choices for the oral presentations. 



7.29 

305 



Activity C was still considered less suitable than Activity A, Responding to Student Writing 

however, because, in the words of the following assessor, 



Activity C indicates theoretical grounding in the issues, but 
Activity A shows the actual practical reaction to "nonstandard" 
papers. 

Our consultant on cultural diversity, Sharon Nelson-Barber brought up another 
perspective regarding the assessment's appropriateness for teachers of diverse student 
groups. Berber examined some of the teachers' responses to Activity A and was struck by 
one teacher's analytical comment about the student's text which in effect said that the 
student had the potential of "straying away from the topic" in future writings. Barber stated 
that this comment "suggests a lack of knowledge about/experience with students who employ 
different rhetorical strategies." That is, research shows that there are considerable 
organizational differences between white children's oral narratives and those of certain 
racial/ethnic groups. Black students, for example, tend to be more episodic and white 
students more "topic-centered" in their oral narratives (Michaels S. and Cook-Gumperz, 
1979). Native American students have also been documented as sometimes exhibiting 
different organizational patterns in their narratives (Colley, R. and Lujan, P., 1982). 

As the assessors did not respond to the above teacher's comment, we do not know if 
they know the information presented by Barber and chose not to react to the teacher's 
comment, or if they are unfamiliar with the information. An assessor familiar with the 
research might rate a teacher lower than an assessor unfamiliar with the research. Thus, in 
order to be a fair assessment for teachers of diverse student groups, it seems that the 
assessors would have to be familiar with the current research on students of different 
racial/ethnic groups as it pertains to the topic of the assessment activity (e.g., student 
writing). 

Fairness across Groups of Teachers 

When asked if they felt the assessment is fair to new teachers of both genders, 
different ethnic groups, different language groups, and other groups of new teachers, the 
majority of teachers and assessors said "yes." Explained one teacher, 

No matter your background, you should be able to do these 
things. 



7.30 



erJc 3b 



Only two teachers responded negatively, one of whom commented, 

/ know of a young man who is Hmong and studying to be a 
teacher. I think Activity C might be difficult for him. I think 
these different cultures need to be represented. 

Another teacher, who chose not to answer the question with a "yes" or "no," agreed with the 
above teacher, remarking that the "oral language area might prove difficult to ESL 
speakers." 

One of the assessors also perceived the assessment to be biased towards fluent 
English speakers, but stated, 

Fluency in English is required, but that doesn't strike me as 
unreasonable. 

Finally, another factor of the assessment was pointed out by an assessor who noted that 
"obviously a candidate whose hearing or sight was impaired would require an adapted 
assessment." 

Aside from these somewhat obvious factors, our consultant on cultural diversity, 
Sharon Nelson-Barber, brought up two other issues. First, she points out that, in Activity 
A, the teachers are directed not to mark every error in the student's writing "unless that is 
your practice on such occasions." She acknowledges that the intent of this direction might 
be "to discourage the teachers from feeling compelled to mark errors simply to demonstrate 
they see them/ but she wonders if a teacher who elects to frequently mark errors might 
receive a lower rating than a teacher who does not. In fact, she cites the following 
assessor's comment on one of the teacher's response forms as an example of probable bias 
against teachers who focus on mechanics: "Response is heavily focused on mechanics* 
Likely to undermine the warmth of the response." 

In response to this assessor, Barber reminds us of the following: 

Many black teachers view the teaching of skills as essential to 
their students? survival~that moving ahead to mastery of 
mainstream language means practice with skills. Thus, 
grammar, punctuation, spelling etc. are precisely the features 
some black teachers are likely to highlight in the assessment. 

7.31 



307 



The other issue Barber addresses pertains to Activity B, "FishbowT DiscuMion of 
Literary Work. In this activity, teachers are expected to discuss a short story in a small 
group format. Barber reminds us that, "as was discerned during the Stanford Teacher 
Assessment Project's (TAP) small group discussion exercises, not all group members 
participate equally in group discussions, even when they know their patterns of interaction 
will be noted and rated." It is therefore important that assessors are aware of differences in 
language use across groups, as well as non-verbal communicative cues and interactive styles 
(e.g., some participants take the role of leaders, others of followers). Such knowledge is 
especially important when the group members are of different cultures, explains Barber, for 
"it is all too easy for a person to feel 'something is wrong* in interactions with people of a 
different background without really knowing what is causing this feeling." 

Thus, once again it seems that the fairness of this assessment for teachers of 
different groups could depend heavily on the knowledge of the assessors. 

Our analysis of the scoring results as they pertain to different groups of teachers 
indicates that females tended to receive higher overall ratings than the males for Activity A, 
Responding to Student Writing and Activity B, "Fiihbowi" Discussion of Literary Work, and 
lower overall ratingB for Activity C, Speaking of Language. In Activity C, for example, all 
three teachers who clearly did not pass were females (it should be noted, however, that 
twice as many females as males participated in the pilot test). 

Further analysis of the scoring results reveals that teachers who described 
themselves as teaching in suburban locations tended to receive higher ratingB for all of the 
evaluation categories for all three activities. Our very small sample of minority teachers (3) 
tended to receive higher ratings than the non-minority teachers on Activity B, "Fiihbowi' 
Discussion of literary Work, lower ratingB on Activity A, Responding to Student Writing 
and a mixture of lower and higher ratings for Activity C, Speaking of Language, (For more 
information on trends of differences between teachers with different characteristics, see the 
section, "Technical Quality.") 

Appropriateness as a Method of Assessment 

In addition to evaluating the appropriateness of the Secondary English Assessment 
for beginning teachers, and its appropriateness across contexts and groups of teachers, the 
teachers and assessors were asked to evaluate the appropriateness of the method of 
assessment, and to compare it with other methods of assessment which they have 
experienced. 



7.32 



Appropriateness. The teachers were asked if they thought the three assessment 
center activities are an appropriate way of assessing (1) general teaching skills, and (2) 
skills in teaching English classes. Approximately 58% (11 of 19) of the teachers responded 
positively to the first question, and 74% (14 of 19) to the second. Many of those teachers 
who did not think the assessment center activities are an appropriate way to assess general 
teaching skills defended their answers by saying that the activities did not require any 
teaching. Some comments illustrative of their viewpoint are as follows: 

You can't assess someone's teaching skills unless you see the*a 
teach. I feel like a failure in this, yet in the classroom I am 
developing confidence, and have been told that I am a good 
teacher by mv peers who have seen me teach . 

I think the written evaluation is good for English assessment, 
but teaching is not only knowing your content area, discussing 
a book with friends, or talking in front of a polite group. It is 
interaction with students. A teacher needs to know how to 
guide a discussion, keep students focused, deal with the 
interruptions, and alter plans when things fall apart. I think 
knowing the philosophies advocated in the articles helps, but 
training for war is not the same as being in the middle of an 
ambush. 

This point of view was also shared by some of the teachers who did not think the 
activities were an appropriate way of assessing skills in teaching English classes either: 

We were not asked to TEACH anything... Our interpretation of 
the "The Lesson" may reveal how well we can read and analyze 
literature, but our discussion doesn't necessarily prove we can 
teach interpretation skills. 

The second exercise [Activity B, "Fishbowl" Discussion] placed 
jjg in a students role, gg£ a teaching role (as if we were buck in 
coh )ge). The third exercise [Activity C, Speaking of Language) 
is much like a speech contest. No teaching is involved.. .not at 
all actual lesson related. 



7.33 

309 



Even one of the teachers who responded positively to both questions, added a qualification, 
saying, 

Except in the case of Activity C, Speaking of Language. What 
did you intend to assess? Speaking skills? Teaching delivery? 
Tm not sure it adequately assesses my ability to deliver a lesson 
to a student. 

Finally, from a teacher who answered the questions with a "yes" and a "no": 

Best way to judge someone's teaching effectiveness and the 
ability to teach students is to actually see it. Many educators 
have all the correct terms and such but choke in front of the room. 

The assessors were also asked to comment on the appropriateness of the assessment 
as a way to assess general teaching skills and skills in teaching English classes. All four 
stated that the assessment activities were an appropriate way of assessing skills in teaching 
English classes. Like the teachers, however, there was more uncertainty as to whether the 
activities were an appropriate way of assessing general teaching skills. Only two of the 
assessors responded with a definite "yes"; one gave a qualified "yes," and the other 
commented as follows: 

Tm not completely sure that a person might not teach well 
despite everything else. ..unless his performance on all three 
[activities] were abysmal. 

In conclusion, while a majority of teachers and all of the assessors think the three 
assessment center activities are an appropriate way of assessing skills in teaching English 
classes, many of the teachers and some of the assessors reject the notion that these 
activities are an appropriate way of assessing general teaching skills. In particular, the 
teachers faulted the assessment center activities for not assessing a teacher actually 
teaching anything. 

Comparison of activities with other moatmonti. All of the teachers were asked to 
compare the three assessment center activities of the Secondary English Assessment with 
other assessments with which they have been evaluated (e.g., multiple-choice exams such as 
CBEST and NTE Specialty Areas TestB, classroom observations during student teaching). 
Of the 18 teachers who responded to the question, 72% (13 of 18) commented favorably 

7.34 



310 



about the assessment, many stating that the assessment is better than the NTE and/or 
CBEST tests. Some of the teachers gave high praise to the assessment because, unlike 
other assessments, it provided an opportunity to learn: 



This has been the most delightful and helpful of the 
assessments Tve gone through. This method is a learning 
opportunity in itself (in dialogue with other new teachers, etc.). 
I think the CBEST is absurd as a realistic method of assessing 
a teacher's capabilities. 

I think this assessment has far more value than any of the 
above-mentioned techniques [because] we are all learning and 
developing as we are being assessed. 

Other teachers praised the assessment activities because they were performance- 
related: 

Are you kidding? You couldn't even compare an assessment 
with this much individual attention and hands-on performance 
with traditional pen and paper tests. 

CBEST is useless. This assessment judges who I am more-its 
more personal. Can I speak well? Can I write well? How do 1 
sound in front of a group? Can I communicate? CBEST 
assesses none of this. 

A few teachers, while not commenting negatively about the assessment or making 
unfavorable comparisons, still championed classroom observations as the best way to assess 
teaching competency for the obvious reason that they assess a teacher actually teaching: 

Classroom observation and student feedback are the two most 
realistic ways of assess teaching competency. ..this assessment is 
all related to teaching...but I don't know if it assesses a 
person's teaching skills. 

I feel the best way to evaluate a teacher is in front of the 
classroom so you can actually ass. the reactions and responses 
to the students. 

7.35 



311 



Finally, one teacher best summed up the comparison of the assessment activities 
with CBEST, NTE and classroom observations as follows: 

No comparison, CBEST and NTE are like playing the trivial 
pursuits literature game. I think that classroom observation 
during student teaching is extremely important, but doesn't 
always reflect the teacher's thinking about teaching. 

Assessment Format 

The format of the Secondary English Assessment has a dual nature: there is the 
format of the assessment as a whole (i.e., the assessment as a single entity), and there are 
the distinctly different formats corresponding to each activity. In this section, the formats 
of the three assessment center activities are sometimes discussed separately and sometimes 
together, depending on the focus of the analysis. In discussing the assessment's preparation 
materials, for example, the assessment is primarily discussed as a single entity. When 
discussing the clarity of the assessment's rating forms, however, each of the three activities 
is looked at separately. The format section is primarily based on the comments of the 
teachers and assessors, as well as the perceptions of FWL staff. 

Clarity of the Teachers 1 Preparation Materials 

In preparation for the three assessment center activities of the Secondary English 
Assessment, each teacher received an Orientation Handbook sent in advance of the 
assessment. This handbook included a section on each assessment activity, with each 
section including the following elements: 

- overview 

- evaluation criteria 

. preparation activities 

- sample instructions 

- evaluation response form 

- preparation materials 

When asked how thoroughly they read the handbook, 89% (17 of 19) of the teachers 
said they read it carefully. One teacher admitted to skimming it, and another read some 
parts carefully and skimmed others. All but one of the teachers said that the assessment 
activities, the aspects of teaching being evaluated, and the preparation activities were 

7.36 



312 



described clearly in the handbook. The one dissenting teacher thought the assessment 
activities and the preparation activities were described clearly, but that the aspects of 
teaching being evaluated were not. 

The majority of teachers (12 of 19) were also satisfied with the infor ition 
presented in tb. handbook. Teachers praised the handbook for being "very complete," 
"clearly organized," and "easy to read." One teacher remarked, 

The format was excellent and left little room for 
misunderstanding. 

In addition to praise for the handbook, there were suggestions for improvement. These 
suggestions ranged from "us? * metal spiral binding instead of a plastic one" to "do not 
include all the teacher r ■ tion forms-they detract from the purpose." 
Several teachers com J that the section on Activity C, Speaking of Language needed 
improvement. One te»cik-r thought the instructions for the activity were repeated; another 
teacher felt the instructions were not explicit enough (i.e., this teacher thought the 
directions should specify that the teacher should prepare ahead of time an outline for a 
response to all of the questions which may be asked at the assessment center). Still another 
teacher wrote, 

Activity C was confusing because there was so much to read 
and interpret. I found myself less clear about what the issues 
are at the end of the reading. 

The above teacher's comment about Activity C is an important one, and is discussed 
further below. As for the ether suggested improvements, FWL staff do not agree that the 
teacher evaluation forms (i.e., the response forms on which the assessors rate the teacher's 
performance) should be eliminated. The inclusion of these forms provides the teachers with 
the exact criteria by which their performance is judged on each activity. FWL staff 
acknowledge, however, that because of the way the handbook was organized, some of the 
instructions for the activities were repeated, and it is possible this redundancy could be 
eliminated or reduced. 

After being asked their opinions about the handbook, the teachers were asked if they 
had any difficulties with the preparatory work described in the handbook. Approximately 
50% (9 of 19) of the teachers answered "yes," the majority of them (6 of 9) citing difficulty 
with the preparatory work required for Activity C. Specifically, the difficulties described by 



7.3? 



313 



the teachers all related to the set of five articles which they were required to read in 
preparation for their impromptu presentation at the assessment center. These articles were 
described by some of the teachers as "dry," "difficult to read," and "of little interest." One 
teacher remarked, 

Some articles were jargon-laden and a little difficult I had to 
reread parts to understand them clearly. 

Other teachers commented on a specific article with which they had difficulty. In all, three 
of the five articles were singled out by at least one teacher as being dry or difficult to 
understand. 

(There were also teachers, however, who felt just the opposite about the articles, as 
evidenced by the following comment: 

I thought the articles in the reading (Roomer, Hirsch, et al.) 
were terrific-they made me think, get inspired. 1 plan to Xerox 
Roomer's article for my colleagues at school.) 

The other difficulties experienced by teachers related to the reading log required in 
preparation for Activity B, TishbowT Discussion of literary Work The three teachers who 
experienced difficulties with this activity all expressed uncertainty as to what they should 
write in the log. Said one teacher, 

I was a little confused about what we were to do for Activity B. 
I wasn't really sure what was expected of me in my reading log. 

Another teacher described a lack of experience in keeping a reading log, and thus had 
trouble writing the minimum amount (i.e., one typewritten page). 

After commenting on the handbook and the difficulties they experienced with the 
preparatory activities, the teachers were asked if they had any other comments about the 
preparatory work required for the assessment. While two teachers cited a shortage of time 
to complete all the work, and other teachers repeated some of the problems described above, 
a little over one-fourth of the teachers commented favorably about the preparatory process, 
most of them referring indirectly to the preparation work required for Activity C, Speaking 
of Language. A sample of these comments follows: 

7.38 

^ I i 



It was important to have the preparation book because it 
allowed time for processing. If I had to evaluate and assess an 
article on the spot, my success would be lower then when I had 
proper time to prepare, think, and process information. 

The preparatory work was excellent to prepare me for the 
assignment. 

It's a good idea to give preparatory work because research is a 
new experience and the preparation work prepared me. It was 
like reading a play before going to see it in the theater. 

This last teacher's comment is particularly noteworthy as it offers a possible 
explanation as to why some of the teachers had a difficult time reading some of the articles 
for Activity C. Reading research articles does not seem to be a common activity of 
beginning teachers-nor probably of teachers in general. Thus, the language ised in the 
articles may be intimidating to some teachers or at the very least be unfamiliar. In fact, 
this set of articles was (1) compiled by several high school English teachers, and (2) selected 
over another set of articles for inclusion in the handbook because these articles were 
deemed to be more readable and interesting! While FWL staff acknowledges that some of 
the articles (e.g., Hirsch's article) are written in what could be called educational 
researchese, we believe the content of these articles is very important and beneficial for 
teachers to read. 

In summation, while the majority of teachers read the Orientation Handbook 
carefully and were satisfied with the information presented, several teachers suggested 
improving the section on Activity C, Speaking of Language. In addition, almost one third of 
the teachers expressed having difficulty with the preparatory work required for Activity C. 
The difficulty they cited most often was that all or some of the articles were difficult to 
understand or not very interesting-- a difficulty which may be the result of a lack of 
experience in reading research articles. Several teachers also expressed difficulty with 
preparing the reading log for Activity B, "Flahbowi" Discussion of literary Work. These 
teachers expressed confusion as to what they should include in their logB. Finally, despite 
some teachers' difficulties with the preparatory work required for the assessment, there 
were other teachers who appreciated and praised the work for preparing them for the 
assessment. 



7.39 



315 



Appropriateness of Time Allotted for Each Activity 



Each of the three assessment center activities was allotted approximately 50 
minutes. Each activity, howevar, utilized this time in a very different way. Activity A, 
Responding to Student Writing, is broken into two parts. In the first part, the teacher 
reads and responds to the two student writing samples. In the second part, the teacher 
evaluates/analyzes the two writing samples on separate forms. For this activity, it was 
suggested that the teacher allot approximately 10 minutes per essay in part one, and 10 
minutes per essay in part two. (Obviously, this gives the teacher an extra 10 minutes to be 
used however necessary). In Activity B, "Fuhbowl" Discussion of Literary Work, 
approximately 40 minutes are allotted to discuss the short story, and the remaining time is 
used by the teachers to write a brief summary of any revised insights into the story or 
observations about the group process that they may have after the discussion (these 
summaries are written in the assessment booklets). In Activity C, Speaking of Language, all 
c r the teachers are given 10 minutes to prepare their oral presentation, approximately 5 
minutes to give their presentation, and approximately 2 minutes to answer a follow-up 
question posed fter their presentation. Much of the remaining time is taken by the 
teachers drawing their topic of presentation from a hat. 

Both the teachers and the assessors were asked if they thought the time allotted for 
each activity was sufficient, too long, or not long enough. Approximately 58% (11 of 19) of 
the teachers thought the time allotted for each activity was sufficient, 32% (6 of 19) said it 
was not long enough, and 10% (2 of 19) said it varied according to the activity. None of the 
teachers said the time allotted was too long for any of the activities. The number of 
teachers specifying each activity as requiring more time is as follows: 

Requires More Time 

Activity A, Responding to Student Writing l4) 
Activity B, "Fiehbowr Discussion of Literary Work (1) 
Activity C, Speaking of Language (3) 

In fact, five of the nineteen teachers did not finish Activity A, Responding to Student 
Writing in the time allotted. One of the four teachers who wanted more time for the 
activity explained: 



7.40 



9 

ERJ.C 



I felt hurried to read, respond thoughtfully, and answer all 
questions on Activity A. Although I recognize the need for 
speed in responding to a class' worth of writing (30+ students), 
I felt the questions and the situation created a need for more 
time, 

One teacher who did not finish, however, considered the one- hour time limit to be 
sufficient because "teachers will realistically not be able to spend 15 minutes responding to 
each student's paper." Nevertheless, this teacher also admitted that "for teacher assessment 
purposes, I was unable to write all I wanted to show all that I was thinking." 

Of the three teachers who wanted more time for Activity C, Speaking of Language, 

one called the activity "deep," and said it required "much more 

comprehension/analysis/synthesis" than the other activities. Another teacher commented, 

In Activity C, preparing a speech in 10 minutes was quite 
difficult-and delivering it in 5 was nearly impossible with any 
references to the readings. I may have saved my nerves if I had 
prepared an outline for each topic at home. 

The third teacher who advocated more time for Activity C also suggested the idea of 
preparing an outline for each topic prior to the activity, especially as an alternative to 
allotting more time to prepare the speech at the assessment center. 

Finally, the one teacher who wanted more time for Activity B, "Fifhbowi" Discussion 
of Literary Work remarked, 

I would have appreciated more time for Activity B because just 
about the time we were comfortable with each other our time 
was up. 

As for the assessors, because they did not administer two of the three activities, they 
were only asked to comment on the time allotted for the activity they did administer (i.e., 
Activity B or O. All four assessors thought the time for their activity was sufficient. One 
assessor explained why the time allotted was particularly suitable for Activity B, TwhbowT 
Discussion of literary Work: 



7.41 

317 



The discussions seemed to peak about the 30-minute mark. 
However, sometimes trie "extra" 10 minutes allowed candidates 
who had started slow to recover. Groups in which one or two 
members are an impediment to the group process benefit from 
the full 40 minutes [of discussion]. Quieter, less assertive people 
seem to need some extra time to figure out how to cope with 
more vocal and assertive but less insightful candidates. 

Not all assessors perceived the discussions as peaking at 30 minutes, however. In 
fact, the assessor who was paired with the assessor quoted above remarked that "some 
groups were surprised that 40 minutes had passed so quickly when time was called". 
Nevertheless, as was discussed in the section, "Fairness Across Groups of Teachers," 
teachers differ in the ways they choose to articulate their skills and knowledge in a group 
discussion, and thus enough time needs to be allowed in the activity for these uifferences to 
be identified. 

Two assessors also offered comments on the time allotted for Activity A, Responding 
to Student Writing. Both assessors seemed to be responding to the fact that some teachers 
did not finish the activity and/or gave brief answers to some of the questions. One assessor 
suggested that the teachers "be given a tii~e warning 1/2 way so that they address both 
papers equally." The other assessor commented, 

For some candidates with brief responses on Part U, I wasn't 
sure if they ran out of time or didn't have much to say...Maybe 
more time should be allowed for the activity. 

Based on our observations of the activities, and on the teachers' performance on 
each activity, FWL staff tend to agree with the teachers and assessors who consider the 
time allotted for Activities B and C to be sufficient In response to those teachers who 
advocated directing the teachers to construct ahead of time an outline for each topic 
question in Activity C, Speaking of Language in order to be prepared at the assessment 
center, we agree that this is a possibility to be considered. However, because the activity is 
designed to measure a teacher's skill in delivering "impromptu" oral presentations-of the 
sort that might be given at teachers' or parents' meetings in response to audience questions- 
-it seems that directing teachers to prepare ahead of time for the presentation (or giving 
them more time at the assessment center to prepare) would somewhat invalidate the 
"impromptu" nature of the activity. (In fact, for a truly impromptu presentation, 
consideration should be g, ven to not providing the topic questions ahead of time with the 

7.4^ 

3LX 



set of research articles as this allows teachers to prepare an outline for each topic question 
if they choose.) 

In response to the concerns raised by some of the teachers and assessors about the 
time allotted for Activity A, Responding to Student Writing; FWL staff notes that seven of 
the 19 teachers did not clearly pass the activity, and four teachers thought more time should 
be allotted for the activity. FWL staff recommends that consideration be given to 
extending the time for the activity by at least 10 to 15 minutes. Although we agree that 
teachers in practice do not have the luxury of unlimited time when responding to student 
writing, we think teachers should have adequate time when evaluating student writing for 
colleagues. Thus, the time allotted for the first part of the activity (i.e., responding to 
student writing) could remain the same, while more time could be added to the second part 
(i.e., evaluating the student text). 

Clarity of the Rating Forms and Process 

The rating process for the three assessment center activities was described briefly in 
the introduction and in the "Scoring" section. To recap, using a four-point scale, the 
assessors holisticaliy rated the teachers' performances for each activity on three levels: (1) 
according to specific criteria listed under an evaluation category, (2) the evaluation category, 
and (3) the overall level. Each of the assessors was asked about their experience in rating 
the teachers for Activity A, Responding to Student Writing and for whichever activity they 
administered (i.e., Activity B, "FishbowT Discussion of literary Work or Activity C, 
Speaking of Language). 'Tieir responses are discussed below. 

Activity A, Responding to Student Writing. The assessors were first asked if they 
had any difficulties evaluating (a) Part I-the teacher's responses to the student writing, and 
(b) Part II-the teacher's analysis of the student text. The two new assessors said thay had 
difficulties evaluating both parts. Of the two experienced assessors, one had difficulties 
evaluating Part I, and the other had difficulties evaluating Part n. Thus, three-fourths of 
the assessors had difficulties evaluating each part of the activity. 

When asked to describe their difficulties and make suggestions for improvement, the 
three assessors who had trouble with Part I all referred to the very first part of the activity 
which requires the teacher to select, from eight possibilities, their purpose(s) in responding 
to the student writing samples. These possibilities are as follows: 



7.43 



3lfl 



a) Establish myself as a "friendly" audience 

b) Inform the writer of problems for future work 

c) Inform the writer of strengths s/he demonstrates 

d) Demonstrate the primary criteria I'll be using to evaluate student 
writers, such as the importance of correctness, principals of 
organization, use of details, or other concerns 

s) Guide revision of this particular piece of writing 
0 Establish myself as an authority on good writing 

g) Establish myself as an interested reader 

h) Other (please explain) 

The teachers are told in the directions for this part of the activity that their responses to 
the student writing samples will be evaluated in the context of their stated purposes. This, 
however, tended to present a problem for the assessors. 

One assessor, for example, had difficulty rating teachers who marked purposes that 
she felt were inap, ^priate for the context of the assessment (i.e., the first writing 
assignment of the year): 

The context obviously called for few to no corrections but 
candidates were allowed to choose that approach. Though Of s 
poor pedagogy [on their part), I felt it unfair to penalize them. 

Another assessor said that, "based on candidate's choice of purpose," she was "sometimes not 
certain" how to rate the teacher's performance on two of the five criteria listfU on the 
response form for Part I. A third assessor remarked, 

Some of the purposes seem distinctly easier to carry out for 
new teachers (e.g., "friendly audience"). 

Although only three of the assessors reported difficulty with Part I, all four assessors 
offered some suggestions for improving Part I. These suggestions were as follows: 

• Include an item on the response form that allows the assessor to 
indicate whether or not the teacher's comments to the student are 
consistent with his/her purpose. 



7.44 

3:o 



• To provide easy reference for the assessor, have a place on the response 
form that indicates what the teacher's purposes are. 

• Improve the symmetry between the evaluation criteria listed on the 
response form and the list of purposes. 

• Combine the purposes (a) and (g) into one statement. 

In view of the problems experienced by the assessors, the suggestions made, and the 
earlier comments by Sharon Nelson-Barber which ~ote the likelihood of different teachers 
choosing different purposes, it seems imperative that this part of the assessment (i.e., 
"Purposes in Responding") be revised. Although the "Purposes in Responding" seems to 
have beer, designed with the purpose of helping establish a "context" particular to each 
teacher against which evaluators might more accurately judge the teachers' comments, the 
assessment's design did not go far enough so that this purpose could actually be realized. 
FWL staff strongly recommend that the assessor's first three suggestions made above be 
followed, and that all of the purposes Hsted as well as the scored responses from this year's 
pilot test be reviewed and discussed to address the question of whether teachers who mark 
some purposes over others tend to get higher ratings. 

For Part II, all three assessors who reported difficulties evaluating the teacher's 
analysis of the student writing samples described the same difficulty. The difficulty was 
evaluating the teacher's performance according to two of the five criteria listed on the 
response form. These two criteria-"!! . Identifies effective features of the text," and "C. 
Identifies problematic features of the text."~actually require the assessor to rate five 
different dimensions each: content, structure, development, style/voice, and mechanics. 
The difficulties in rating came from these detailed criteria. Explained an assessor: 

Part II, B and C were the moat difficult to rate and come to an 
agreement on. Assigning a number value for each item was difficult 
since many of these items affect each other or overlap. 

In response to this problem, all three assessors suggested the same solution, 
described by one assessor as follows: 

Instead of responding in such detail-respond for cateeorv. [and] 
leave listed features for assessors to comment on. 



7.45 



321 



In other words, for the two criteria, the ratings along each of the five dimensions would be 
eliminated; instead, the assessors would give one general rating for each criterion, and 
would use the five dimensions as guides to arrive at each rating. 

Despite the difficulties mentioned above, none of the assessors had difficulty giving 
an overall rating to the teacher's performance on Activity A, Responding to Student 
Writing. One of the two new assessors who had difficulties evaluating both Part I and II, 
explained why she had no difficulty giving an overall rating: 

If a student's paper was handled with intelligence and interest, 
taking into account the context, it was easy to weigh that heavier 
than any flaws in analysis for peers unless the errors or omissions 
were gross. 

Judging from her comment, it would appear that this assessor weighted Part Ps 
rating(8) more heavily than Part IPs. Although assessors were not instructed to do this, 
they were also not instructed not to do this. It would be interesting to know if the other 
assessors used the same or a similar process ir A arriving at their overall ratingB. 
Interestingly, interrater reliability was the )iighest for the overall rating of this activity than 
for any of the other activities. But if this process was used, it raises some questions: Is this 
weighting desirable? If not, is it unavoidable? Since there are only two subsets, how does 
one arrive at a holistic rating if one subset is rated lower than the other? These questions 
can not be answered here, but in revising the rating form and process tor Activity A, 
Responding to Student Writing; they should be explored. 

Activity B, "Fif hbowP Discussion of literary Work. Because only two of the four 
assessors scored this activity, only they were asked to answer questions about the scoring of 
this activity. 

Of the two assessors, neither had difficulty evaluating the teacher's responses during 
the activity, and neither had difficulty evaluating the teacher's group process skills. One of 
the assessors, however, expressed difficulty with evaluating the teacher's interpretive skills. 
This assessor focused on two problems. The first was with the criterion, "Offers thoughtful 
and sound interpretive insights," one of five criteria listed under the evaluation category, 
Interpretive Process. The assessor described the problem as follows: 



7.46 

3 >0 



"Offers thoughtful and sound interpretive insights" - this phrase 
still puzzles we. lean disagree with an interpretation that I find 
thoughtful. I don't see how I could disagree with an interpretation 
I considered sound. So what do I do with a thoughtful 
interpretation I disagree with? 

The second problem, as described below by the assessor, is less clearly defined, but 
raises an important question worth being considered: 

I am also struck by the mullers and the formulators, those who put 
their finger on ambiguity and those who resolve it Does the 
assessment reward glib formulators more than patient, tenacious 
mullers? I don't know. 

Indeed, perhaps the question is not whether or not the assessment rewards different 
styles of interpretation, but rather do the assessors? That is, when giving a rating, does the 
assessor consciously or unconsciously favor a particular style of interpretation or way of 
working in a group (i.e., group process skills)? Looking at the ratingB again for Activity B, 
the answer would seem to be yes-at least for interpreting group process skills. Under the 
evaluation category, Group Process, six of the 19 teachers received two different ratingB 
(e.g., a "3" and "4," or a "2" and "3") from the two assessors, indicating the likelihood that the 
two assessors were operating with different biases. 

As with Activity A, the assessors for Activity B did not express any difficulties in 
giving an overall rating for the activity. One of the assessors did note, however, that the 
"overall rating of 3 covered a wide range." 

Both assessors were also asked how frequently they used the teacher's log and the 
summitry notes written at the end of the activity to aid in the evaluation of the teacher's 
interpretive skills. One assessor never used either item, but commented, 

I think these could be very important for marginal candidates or for 
resolving discrepancies between assessors' scores. 

The other assessor reported using both items "for some teachers," and described the 
items as "very important" because, 



7.47 



323 



When I was unsure if a candidate read and understood "The 
Lesson," I could check the log to verity both comprehension and 
interpretation. 

Two other comments were made related to the rating process for Activity B. One 
referred to the space on the response form reserved for comments. The assessors generally 
used this space to write down notes as the teachers were discussing the short story. These 
notes, however, were often not understandable to anyone who had not observed the 
activity. One assessor commented on the difficulty she had writing understandable notes, 
and offered a suggestion for improvement: 

I had some difficulty in organizing my notes into comments that 
could be understood by a reader who had not observed the activity. 

Perhaps notes and comments should be written on a separate piece 
of paper. Divide the paper into two columns-one for notes taken 
during the activity, the second for commentary that would explain 
the relationship of the notes to the ratings. 

Assuming that it is important for the ratings to be supported by some sort of legible 
and understandable evidence, especially when the rating is a negative one, FWL staff 
believes this assessor's suggestion should be strongly considered. At the very least, any 
future training for assessors should include instruction on how to write their comments in 
an appropriate manner. 

Finally, an important recommendation was made by one of the assessors, possibly as 
a result of her experience on the last day of the pilot test On that day, because there was 
only one group of four teachers being assessed, the two assessors of Activity B, "Ribbowi" 
Discussion of Literary Work were able to view the administration of Activity C, Speaking of 
Lmg ui gs after they had finished and scored their activity. After both activities were 
completed, FWL staff heard the assessors commenting on the differences in the teachers' 
performances in the two activities (i.e., Activity B and O. Thus, FWL staff agrees with the 
following assessor's recommendation and contention: 

Continue to have separate assessors for Activities B and C, If I had 
observed candidates' participation in C, it could have influenced my 
rating? for B. 



IAS 



ERIC 



3:t 



Activity C, Speaking of Language. Of the two assessors who administered and 
scored this activity, the experienced assessor reported no difficulties with evaluating the 
teacher's performance for this activity, while the new assessor experienced two difficulties. 

First, the new assessor experienced difficulty evaluating the teacher's responses 
during the activity (i.e., while the teacher was giving his/her oral presentation). As she 
explained, 

It was difficult to attend to content and take notes (as in college 
lectures). 

The assessor also added that the difficulty was worse for some presentations than 
others because some of the presentations did not correspond as well to the scoring criteria 
listed on the response form for the activity. 

This assessor also experienced difficulty evaluating the teacher's skills in planning 
the presentation. In particular, the assessor had difficulty rating the teacher on two of the 
five criteria listed under the evaluation category, Plan of Presentation. These two criteria, 
"Communicates clear central idea or question" and "Clarifies issues with analysis or 
reasoning," were especially difficulty to rate, said the assessor, under the following 
circumstances: 

If a teacher's ideas-however clear or well analyzed-did not tit the 
research or take into account the totality of the issue (e.g., "canon 
vs. multi-culturaT ), I didn't know how to assess them. 

The assessor seems to be saving that if the teacher's presentation communicated a 
clear central idea or question and clarified issues with analysis or reasoning, but did not 
reflect the set of research articles to be read in preparation for the activity or perhaps only 
addressed a small part of a very large issue, then the assessor was not sure how to rate the 
teacher on the criteria named above. Or, in other words, how should an assessor rate a 
teacher who gives a good presentation, but does not really address the issue? This question 
is not addressed in the present training design, but should be in the future. 

Although both assessors said they had no difficulty giving an overall rating for the 
teacher, it should be noted that six of the 19 teachers received two different overall ratings 



7.49 



(e.g., a "3" and "4" or a "3" and "2") from the two assessors for this activity. In the comment 
below, the experienced assessor makes an important observation about the scoring 
differences between him and his fellow assessor: 

We ? r iistently agreed on our evaluations of the candidates' 
per for trances; the only time we had a major disagreement on score, 
we discovered, upon discussion, that we had both seen the same 
shortcomings in the candidates' s presentation, but disagreed on 
how much these shortcomings should lower his score. 

Thus, because the scoring is a holistic process versus being anchored to samples of 
performances, different reactions to the same data can result in different scores. Perhaps if 
more examples were provided in the training to illustrate ratings, scoring differences could 
be reduced. 

When asked how frequently they used the teacher's reading log to aid in their 
evaluations of the teacher's content and organization skills, both assessors said they used it 
"for some teachers." Commented one assessor, 

For those whose presentations were less than very strong, the log 
served as an additional source of information that might help the 
candidate's score. 

The other assessor also valued the reading log and suggested that it be rated under 
the subset, Plan of Presentation. 

One other comment was made by an assessor about the format of Activity C. The 
assessor suggested adding one or two articles about adolescent literature to the set of 
articles in the handbook for the activity. The addition of the articles would serve two 
purposes: (1) new questions could be created to serve as topics of presentations, and (2) the 
addition of more questions would make it possible for each teacher to uraw two questions, 
selecting one and discarding the other. As the original assessment activity included 
questions about adolescent literature-but no corresponding articles-FWL staff think the 
addition of such articles and questions would enhance the activity, as would the revised 
format of allowing each teacher to select one question after drawing two. 



7.50 

eric 3:f; 



In conclusion, revisions need to be made to the rating process and forms for all of 
the three assessment center activities, but particularly to the process and form for Activity 
A, Responding to Student Writing. In addition, based on the assessors' comments and the 
scoring results, consideration needs to be given to greatly reducing the probability of 
assessors interpreting the same data differently. That is, there needs to be more consensus 
as to the kind of performance that is represented by each point on the rating scale. 
Providing the assessors with a well-written scoring manual with numerous examples during 
training could help address this problem. Training and a scoring manual could also address 
the way in which assessors should write their comments to support the ratings they make 
for each activity. 

Cost Analysis 

Administration and Scoring Cost Estimates 

The Secondary English Assessment Activities are administered in an assessment 
center format. The current structure of the activities are such that four candidates can be 
administered the three activities in a half-day session using four assessors. Thus, it requires 
approximately one half-day assessor of time per half-day assessment for each candidate. An 
additional hour for preparation and finalizing as assessment is needed for each assessor. 
Using a rate of $20/hour } lelds an estimate of $100/assessment for administration and 
scoring of the threo activities. 

Training for this assessment was two days. Future assessments would require at 
least this amount of training and the training could be extended to three days. If we 
assume that each assessor would conduct 30 assessments each year for five years, we could 
distribute the costs for training an assessor over 150 assessments. Reimbursing assessors for 
three days training at $160/day or $20/hour would cost $480. Distributing the $480 over the 
150 assessments adds approximately $3/assessment for training. 

Other costs include those associated with telephone, duplication, postage, and travel 
where needed. Travel could be expensive in California unless regional assessments were 
used. A regional assessment would minimize travel costs. Estimating costs for these 
activities or ingredients would depend in large part on the manner in which the system was 
ultimately designed and how costs were apportioned. Using a figure of $30 per assessment 
for these activities would assume only minimal travel costs, based on our experience from 
the pilot testing. This is the same entimate that was used in the First Year Report on Pilot 
Testing. 

7.51 



327 



These result in the following cost estimates to administer and score the three 
Secondary English Activities in a half-day assessment costs format: 

Assessor Costs: $100/asse8sment 

Training Costs: $3/assessment 

Other Costs: $30/as8essment 

Total Admin/Scoring $133/assessment 

Development and Pilot Testing Costa 

The c r 3ts for developing the four activities for this assessment were $84,415 and are 
broken out by Cost Category in Table 7.5 which also includes costs for pilot testing. These 
development costs are the expenses for the assessment developer to deliver drafts for these 
activities to the CTC and SDE. The developer was building on prior work with these 
assessment activities and approaches; thus, future development costs were be more similar 
to these than if a new development effort was initiated. Additionally, approximately 
$45,429 were incurred for the pilot testing of these assessments with 19 teachers. 

These provide samples of developmental costs that should be considered if a similar 
assessment were to be adapted for implementation. 

Technical Quality 

This section discusses the technical issues related to the three assessment center 
activities of the Secondary English Assessment-development, reliability, and validity. 

Development 

Although this assessment was developed during the period of May 17 to December 
29, 1989, two important sources of information contributed to the preliminary design stages: 
(1) an August 1987 California State University workgroup, and (2) the piloting of the 
English 677 course at San Francisco State University during the spring semester of 1988. 
The workgroup created a comprehensive list of desired competencies for prospective English 
teachers and proposed some alternative plans for assessing these competencies, The 
experimental English 677 course, titled "Performance in English," assessed selected 



7.52 



TABLE 7.5 

DEVELOPMENTAL AND PILOT TEST COSTS FOR THE 
SECONDARY ENGLISH ASSESSMENT 



/ ^St Categories • j' ^ 


otan-oaiaries & benems 






Consultants 
(Teachers, assessors, 
and other consultants) 




10,49) 


Travel (Consultants and 
staff) 


5,103 


4,044 


Other Direct Costs (Site 
rental, phone, 
duplication) 


4,000 


2,085 


Total Direct Costs 


$67,532 


$35,912 


Indirect Costs 


16,883 


9,517 


Total Costs 


$84,415 


$45,429 



•These costs are those for developing the three activities pilot tested and reported here 
and a portfolio to be pilot tested this fall-winter. Pilot test costs are those for pilot 
testing these three activities. 



7.53 



performance abilities of credential candidates in English. Of the approximately 15 
assessment activities piloted in the course, three were selected for inclusion in the 
Secondary English Assessment developed for the California SDE/CTC. 

After the identification of the assessment activities, three groups of participants were 
identified and recruited to review the activities and/or to develop assessment materials and 
procedures for the activities. These three groups were: (1) five expert Fighsh educators 
involved in teacher training institutions, (2) eight veteran English teachers, and (3) 16 new 
teacher candidates who acted as subjects of the assessment activities and assisted in revising 
and refining the assessment package. 

Upon completion of draft materials and procedures for each activity, a pilot test was 
conducted in an August 1989 workshop. The participants in the pilot test were the eight 
veteran English teachers and the 16 new teacher candidates. Based on the results of this 
workshop, revisions were made to the assessment materials and procedures. Further 
revisions were also made throughout the fall by the assessment developer and two of the 
veteran English teachers, and final revisions were made after a December 1989 meeting of 
the eight veteran English teachers. 

Reliability 

The following analyses were performed on the pilot test data of 19 teachers. 
Interrater agreements were examined to assess the degree to which assessors were able to 
consistently judge candidates using the English-Language Arts Assessment scoring 
protocols. Internal consistency estimates were generated to assess the degree to which the 
variables or factors within each of the activities would form a measure and the degree to 
which the different activities related to each other and might form an overall assessment of 
a candidate. 

Interrater agreements. The first measure of agreements among judges was obtained 
by comparing the number and percent of ratings in which assessors gave identical or 
different ratings. Figure 7.2 presents the percent of exact agreements for Activities A, B 
and C. They range from a low of 53 percent for Activity A's Overall Response Strategies to 
a high of 89 percent for that Activities Overall Rating. The only variable on which raters 
differed by more than one point was on Activity B's Interpretive Process where two of 19 
teachers received ratings two points apart. This level of agreement on the three activities 
suggests that a fairly high degree of agreement has achieved the training and scoring 
associated with the pilot test. 

7.54 



3,ll) 



FIGURE 7.2 
Percent Agreement Between Raters for the 
Secondary English Assessment Activities 



Percent of Teachers 




Response Overall A Group Content Delivery 

Text Interpret Overall B Plan Overall C 



Activity A Activity B Activity C 



Interrater correlation!. Correlations between raters also serve as an estimate of 
interrater agreement. The correlations among rater pairs are displayed in Table 7.6. 
Correlations were calculated for each variable on each of the activities and for the Overall 
Rating on each activity. For each variable there are two measures. The first which is 
labeled Rating (R) is the holistic rating given for the variable. The second which is labeled 
Summed (S) was obtained by summing the individual items for that variable to form a score. 
Two rater pairs rated Activity A: Responding to Student Writing; three rater pairs rated 
Activity C: Speaking of Language. The average correlations across rater pairs were 
calculated and are presented for these Activities. For Activity B: Fishbowl Discussion of 
Literary Work a single rater pair rated all 19 teacher candidates and thin correlation is 
reported. 

Using the average across rater pairs as the interrater correlations, range from 
Activity A: Responding to Student Writing's Overall Response Strategy where the 
correlation is zero (0) to Activity C: Speaking of Language's Presentation Plan where the 
average correlation was .96 for the Holistic Rating. There was no particular pattern for the 
Holistic Ratings and Summed Ratings to have higher or lower agreements among the raters. 
Thus, it appears that using only the Holistic Rating for each variable or omitting the 
Holistic Ratings for the subparts of each activity and summing the individual ratingB are 
both viable approaches in terms of the degree of agreement that will be observed between 
raters. Holistic ratings could reduce the rating time. 

The variability of these correlations reflect both random fluctuations due to the 
small numbers of teachers rated and a need to further refine and develop the rating system. 
Given the draft status of this assessment, these results suggest that the assessment and 
scoring systems could be developed to yield reasonable agreements among raters on these 
types of tasks or assessment. 

The rating system allows for the raters to rate NA on those items for which they 
judge insufficient information was available to make a rating. Examining the degree to 
which rater p°irs observing the same candidates rated the same items as NA also provides a 
measure of rater agreement. Across all items for the three activities, 3fc.3 percent of the 
ratings had one rater but not the other rate an item as NA. Thus, for about one third of 
the ratingB one assessor but not the other judged there was not sufficient evidence on which 
to make a rating. An implication is that the training and ratiu^s would be strengthened by 
using additional tapes of teacher candidates to provide examples on which raters can rate 
aud discuss ratingB including what constitutes sufficient evidence for making a rating. 



7.56 



TABLE 7.6 

CORRELATIONS BETWEEN RATERS FOR THE SECONDARY ENGLISH ASSESSMENT 
ACTIVITIES FOR HOLISTIC RATING (RT) AND SUMMED RATINGS (S) 





Typeset 


; i ?;t5s ; ^xis;:v: : ;;-:;:;Hatef,Paif- 


Averaged 


Acttvity/Part • 


Rating 




2 


3 


Pair Ratings 


Activity A, Responding 
to Student Writing 












Response Strategies 


RT 

S 


0 
.07 


0 

.26 




0 

.16 


Analysis ot wilier 

to Text 


PT 
n I 

s 


NA 

.22 


.65 
.49 




Kl A 

NA 

.36 




'•••'dt 

Ml: 

>^|^ 


.25 

; x .25;- : :.\; 


■;);';,.75:;L: 




.90 
47 


N's 




1 1 








Activity B, "Fishbowl* 
Discussion of Literary 
Work 












Interpretative 
Process 


RT 
S 


.48 
.55 








Group Process 


RT 
S 


.53 
.17 








l'Oviria|lctivity 8 


■: V RT:.-; 


Bill 








;>y.v:v .v.-.. ■ • • • . y , 




4394;.:;: 








N 




19 








Activity C, Speaking 
of Language 












Content 


RT 
S 


.65 
.64 


.85 
.50 


.91 
.87 


.83 
.71 


Plan 


RT 

S 


.56 
.58 


1.00 
.71 


1.00 
1.00 


.96 
.88 


Delivery 


RT 

S 


.51 
.51 


CO CO 
CO CO 


.82 
.82 


.59 
.59 


ISferafl Activity C ;J|| 


RT. 


Hili! 
;#b.r. 




.59 


.73 










74 


N's 




11 


4 


4 





7.57 



331 



Internal conuitency of the tasks and assessment Coefficient Alpha reliability 
estimates were calculated for the three different activities and their subparts by using the 
individual rating on items within each subpart. The reliabilities for the activities and 
subparts are listed below: 

fotivitv/Subnart Reliability 

Subpart Total 

A: Responding to Student Writing * 
Response Strategy 
Analy. Writer & Text 

B: Fishbowl Discussion 
Interpretive Process 
Group Process 

C: Speaking of Language 
Content 

Presentation Plan 
Delivery 

* Indicates that insufficient ratings were made due to the number 
of NA ratings. 

These estimates indicate a relatively high degree of internal consistency within the 
subparts and total activity evaluations/ratingB. A review of the data suggest that raters tend 
to assign the same rating to items within an activity and to the overall rating for the 
activity. An implication is that it might be possible to have raters simply assign ratingB to 
the different subparts and not take the time to rate individual items within subparts. This 
would allow for providing some feedback on candidates' strengths and weaknesses within 
each of the activities and lessen the time for ratingB from what is required if individual 
ratingB are made for each item. The reliabilities for each Activity suggest that the activities 
do fornr a measure in which an overall judgment or evaluation can meaningfully be made. If 
there was low or no internal consistency within the activities or their subparts, then it 
would call into question what is being measured and would undermine the interpretability 
of any composite or overall evaluation for the activities. 



.77 



.86 
.85 



.90 



.74 
.83 
.78 



.91 



7.58 



Intercorrelationi among activities. Correlations among the three Activities were 
calculated for the 19 teacher candidates and are reported V'low. 



A 



B 



C 



Activity A: Responding to Student Writing 



Activity B: Fishbowl Discussion 



.26 



Activity C: Speaking of Language 



.58 .36 - 



Given the relatively small N of 19, only the .58 correlation between Activity A and C 
is statistically significant. However, these correlations provide some support that although 
the activities are related there will be not be as strong a relationship across activities as 
there is consistency within each activity. Further support for this is provided when 
coefficient alpha is calculated as a measure of reliability across all activities. For the 19 
candidates in the pilot test the internal consistency across activities was .67. This indicates 
some internal consistency across all activities but also provides tentative evidence that the 
activities measure somewhat different attributes of the teacher candidates' performance. 

Validity of Agreement Through Group Comparisons 

Differences in performances were examined for minority-nonminority, women-men, 
high school-middle school, urban-inner city-suburban, teachers and for the number of 
courses teachers had completed in the subject area. It was felt that this could provide at 
least preliminary glimpses of the assessments difficulty for different groups. Some of these 
analyses that compare different groups have been discussed in earlier sections. 
The pilot test sample size and design were not constructed to provide information sufficient 
to provide stable estimates comparing differences among these groups. For example, some 
subgroups have as few as three teachers in them. Nevertheless, an examination of 
differences among groups provides some initial insights into the validity of this assessment. 
Table 7.7 contains a summary of the trends for the pilot sample of 19 teacher candidates. 
Appendix F provides the means, standard deviations and numbers of candidates from which 
these summaries were constructed. A plus (+) simply indicates that the mean or average 
for the first group was greater than that for second group. For example, the pluses under 
the Female-Male column indicate that for 3 of 10 variables the female's average was greater 
than the males. 



7.59 



336 



TABLE 7.7 



TRENDS OF MEAN DIFFERENCES BETWEEN CANDIDATES WP ; DIFFERENT 
CHARACTERISTICS FOR ACTIVITES AND EVALUATION CATEGORIES* 



Sv!V ■ • '• •:: : f-':l-: : : : :-: ;: - 

• 

• • • 

• 


v!' : . - . '•■•'••■•! 

Sender 


Prepara^ 

liSllll 

MoreT 


Level of ; : | 
Teachfna 


Reaching 

llipcation 

Suburban 
{Urban 

Inner City) 


Ethnicity 

Non- 
Minority- 
Minority 


Activity A, Responding 
to Student Writing 












Response Strategies 


+ 


+ 


+ 






Analysis of Writer 

to TRYt 


+ 


+ 






+ 




. ".:...t.:...:.- 




. V./" 






Activity B, "Flshbowl" 
Discussion of Literary 
Work 












Interpretative Process 


+ 








- 


Group Process 


+ 


- 


+ 




- 


OveraH AHiSC^B: 

Activity C, Speaking 
of Language 






t.. .... 






Content 


+ 


+ 


+ 






Plan 


+ 


+ 


+ 


+ 


+ 


Delivery 


+ 












; X\*v ; :-v: :v:v: :-: :v:y: :v: : : : : : 










SUMMARY 


10/10 


8/9 


9/10 


10/10 


6/10 



•Entries reflect the direction of the mean differences for the different candidates. For 
example, in the activity and the evaluation category Responding to Student Writing, 
Response Strategies, the average or mean of female teachers in the pilot test was greater 
than tho males. These do not generally represent statistically significant differences and 
due to small N's no tests of significance were calculated. 



9 

ERIC 



7,60 

3;j? 



Trends in the table provide some preliminary evidence that on this assessment and 
pilot test sample; 

• females' averages were greater than males on 9/10 evalua. jn categories; 

• teachers with more preparation/courses had averages that were greater than those 
with less (i.e., 0 or 1 course) on 9/10 scores; 

• high school teachers' averages were greater than junior/middle school teachers' on 
10/10 evaluation categories; 

• teachers in suburban settings had averages that were greater than those in 
urban/inner city settings on 10/10 evaluation categories; and 

• nonminority teachers' averages were greater than minority teachers' on 6/10 
evaluation categories. 

If these trends were to hold for larger, more representative samples, some of thes 
trends would be encouraging evidence for the "validity" of the assessment, whereas others 
would provide less encouraging findings. For example, if teachers with more preparation 
courses perform superior to those with fewer this would provide some positive evidence that 
the assessment does differentiate among those with greater and less knowledge/preparation. 
Similarly, it would be desirable for the assessment to minimize any adverse impact on 
minority teacher candidates. Thus, although the difference between minority and 
nonminority teachers was less than others (e.g., preparation or teaching location) a sample 
of three prohibits drawing any conclusions about how minority teachers would perform on 
this assessment. 

The above comments address ways in which group performances can contribute to 
evaluating the appropriateness and difficulty of the assessment However, the results also 
can provide information that will be useful for teacher preparation, training and 
recruitment. For example, it is not encouraging that urban and inner city teachers score 
lower than others. This finding provides further support that the urban and inner city 
schools may very well not be attracting or securing as strong a new teacher force as 
suburban schools. The trend for middle and junior high school teachers to perform less well 
than high school teachers could imply a need for strengthening the content specific 
preparation of these teachers. 



7.61 



33S 



Content validity. The content validity of this assessment rests largely in the role 
that teachers and English educators had in the development, and the analyses of the match 
of the assessment to the model curriculum and teaching standards which supplies evidence 
that the assessments content validity with current and emerging content. These have been 
described earlier and implications for further development are described in the following 
section. 

Conclusions and Recommendations 

This section contains conclusions and recommendations regarding the three 
assessment center activities of the Secondary English Assessment. The section presents 
information in the areas of administration, content, and format, and concludes with a brief 
summary. 

Administration of Assessment 

Th« Secondary English Assessment, as discussed in this report, consists of three 
distinct activities, each of which is approximately 50 minutes long. As administered in this 
pilot test, the first activity was scored in the afternoon, after the teachers had completed all 
of the assessment activities. The second and third activities were scored during and 
immediately after they were administered. Thus, for this pilot test, the administration and 
scoring of the Secondary English Assessment required approximately five hours a day per 
assessment. 

Based on our experience, the following factors seem to be key to smooth 
implementation of the Secondary English Assessment (or any assessment that includes 
similar assessment center activities); 

• recruitment of assessors who are experienced English teachers, who have had 
experience in formal writing assessment programs involving holistic scoring of 
writing samples, and who are knowledgeable about different teaching 
styles, interactive styles, and patterns of communication; 

• availability of appropriate assessment center facilities (e.g., two rooms 
for assessment); 

• development of clear orientation materials for teachers which include 
descriptions of each assessment activity, the criteria by which the teachers will 



7.62 



be evaluated, and all preparation materials needed to be read/completed 
before assessment day; and 

• development of procedures to collect and store assessment materials for each 
activity such as assessment booklets, evaluation response forms, and completed 
preparation materials (e.g., reading logs). 

Another key factor to the smooth implementation of assessment center activities 
such as those of the Secondary English Assessment is assessor training. Good assessor 
training serves to familiarize the assessor candidates with the content of the activities, as 
well as how to administer and score the activities. Although all of the assessors described 
the training for this pilot test as "very good," they also made some suggestions for 
improvement. Based on the assessors' comments, our observation of the training, and on 
performance data from the assessment activities, we believe the training could be improved 
by following these recommendations: 

• Development of an assessor handbook to serve as a guide for assessors 
when administering and scoring the activities. In particular, the 
handbook should include a complete description of the scoring process 
and specifically provide concrete examples whenever possible of (1) the 
distinctions between rating points, and (2) the way in which comments should 
be written on the scoring response sheets. 

• Extension of the training by one half to one full day in order to address more 
thoroughly the material covered in the assessor handbook (e.g., the details 
of the scoring system). 

Following the above suggestions should greatly facilitate he administration of the 
assessment center activities. 

Asmm ment Content 

Based on the observations of FWL staff, as well as information from assessors, 
teachers, our consultant on cultural diversity, and the assessment documentation (i.e., the 
scoring response forms for each activity), the following conclusions are offered about the 
content of the Secondary English Assessment's three assessment center activities. 



7.63 



310 



• Congruence of the three assessment center activities with the English- 
Language Arts Framework is weak. Activity B, Tiihbowl" Discussion of 
literary Work and Activity C, Speaking of Language would especially 
have to be revised in order to achieve strong congruence. In particular, 
the two activities would need to be changed so that there is a greater 
focus on a teacher's skill in responding to or developing students' 
abilities versus a demonstration of skill in activities that have only an 
indirect re lationship to teaching students. 

• Coverage by the Secondary English Assessment of the California 
Standards for Beginning Teachers is also weak. Not one of the 
standards is fully covered by any of the activities, and not one standard 
is directly addressed by Activity B, "FiihbowT Discussion of literary 
Work or Activity C, Speaking of Language. Although most of the 
standards are addressed by Activity A, Responding to Student Writing 
they usually are done so in an indirect or limited way. 

e Based on the teachers' and assessors' comments, the three assessment 
center activities seem to be job-related, although Activity C, Speaking of 
Lang ua gB seems less so, and Activity A, Responding to Student Writing 
more so. In addition, Activity B, Tiahbowi" Discussion of literary 
Work and Activity C are probably boei; described as indirectly job-related 
because they do not directly assess in any way a teacher's teaching 
skills. 

e Of the three activities, the teachers had the least difficulty with Activity 
B, TisbbowT Discussion of literary Work, with almost 90% of the 
teachers passing. Activity A, Responding to Student Writing and 
Activity C, Speaking of LanguagB were passed by 63% and 68% of the 

teachers respectively, suggesting that beginning teachers may have had 
less opportunity to acquire the skills and knowledge measured by those 
activities than Activity B. 

• Teachers and assessors thought the assessment is fair to teachers of 
different grade levels. However, analysis of the performance data seems 
to suggest that junior high/middle school teachers may be less well 
prepared than senior high teachers for Activity A, Responding to 
Student Writing and Activity C, Speaking of LanguagB. 

7.64 



ERIC 3 J I 



• When asked to comment on how well the assessment activities address 
a beginning teacher's ability to work with diverse students, the 
assessors commented favorably about Activity A, Responding to Student 
Writing but found Activity B, "Fuhbowf Discussion of literary Work 
and Activity C, Speaking of Language to be less suitable. Our 
consultant on cultural diversity pointed out, however, that assessors 
need to be are familiar with the current research on students of 
different racial/ethnic groups as it pertains to the topic of the 
assessment activity (e.g., student writing) for Activity A to be a fair 
assessment. 

• With regard to the question of the assessment's fairness across groups 
of teachers (e.g., different ethnic groups, different language groups), the 
majority of teachers and assessors responded positively. The issue of 
fairness is also largely dependent upon the assessors' knowledge of the 
possible teaching practices and styles of different groups of teachers. 

• The performance data indicates that, with regard to different groups of 
teachers, females tended to receive higher overall ratings than males for 
Activity A, Responding to Student Writing and Activity B, "Elf hbowT 
Discussion of literary Work, and iower for Activity C, dpeaJnng of 
Language. Also, those teachers who described themselves as teaching in 
suburban locations tended to receive higher ratings in all of the 
evaluation categories for all three activities. The performance ratings of 
our small sample of minority teachers (3) were mixed (i.e., some higher 
and some lower) in comparison to those of non-minority teachers. 

• Tftiile a majority of teachers and all of the assessors think the three 
assessment center activities are an appropriate way of assessing skills in 

bing English classes, many of the teachers and some of the 
assessors reject the notion that these activities are an appropriate 
way of assessing general teaching skills. In particular, the teachers 
faulted the assessment center activities for not requiring any teaching. 

Assessment Format 

Although the format of each of the three assessment center activities is distinctly 
different, the format of each has these two features in common: (1) preparation work to be 

7.65 



312 



completed before the assessment day, and (2) a performance-based activity administered at 
an assessment center. Based on comments by teachers, assessors, and FWL staff, the 
following conclusions and recommendations are offered regarding the format of the 
preparation work and of the performance-based activities: 

• Although the majority of teachers read the Orientation Handbook 
carefully and were satisfied with the materials presented, almost one 
third of the teachers said they had difficulty with the preparation work 
required for Activity C, Speaking of Language. The problem cited most 
often was that all or some of the research articles for the activity were 
difficult to understand or not interesting-a difficulty which may 
indicate a lack of experience in reading research articles, 

• The time allotted for each activity (approximately 50 minutes) was 
deemed sufficient by the majority of teachers. Consideration could be 
given, however, to extending the time allotted for part two of Activity A, 
Responding to Student Writing as this activity was not finished by 
approximately 25% of the teachers. 

• Based on assessor comments, the teachers' performance data, and an 
examination of the completed rating response forms, revisions need to 
made to the rrting process and forms for all three of the assessment 
center activities. For each of the following activities, we recommend the 
following: 

Activity A, Responding to Student Writing 

- Revise the scoring process and response form so that the 
teachers' answers to the "Purpose in Responding" part of 
the activity are taken into account on the scoring 
response form. 

- Revise the scoring response form so that there is greater 
symmetry between the list of purposes and the evaluation 
criteria on the response form. 

- Review the scored response forms from this year's pilot 
test to address the question of whether the teachers who 

7.66 



3i3 



I mark some purposes over others tend to get higher 

I ratings. 

I Activity B, "FishbowT Discussion of Literary Work 

I Address the question of whether the scoring process, as it 

I is presently constructed, allows the assessors to 

consciously or unconsciously favor a particular style of 
I interpretation or way of working in a group. 

I - Consider revising the rating process for the activity so 

I that an additional form is used which allows assessors to 

write their notes and comments on one side of the form, 
I and on the other they can give commentary that explains 

the relationship of their notes to the ratings. 

I ~ Continue to use separate assessors for this activity and 

Activity C, Speaking of Language so that a teacher's 

I performance in one activity does not influence his/her 

ratings in the other activity. 

I Activity C, Speaking of Language 

I - In any future training, address the question of how to 

score a teacher who gives a good presentation but does 

I not really address the issue presented in the reading 

' material. 

- Consider adding additional articles and corresponding 

1 questions so as to expand the set of questions used 

for the oral presentations, making it possible for 

I each teacher to draw wo questions (instead of one), and 

* then to select one ana discard the other. 

I • For all activities, consider having assessors only give the holistic rating 

for the evaluation categories and the overall activity, using the 
| evaluation criteria listed under each category as prompts and guides for 

" summarizing and making comments to support the ratings and 

highlight candidates strengths and weaknesses. 

7.67 

I 314 

ERIC ' 



Summary 

While all three assessment center activities of the Secondary English Assessment are 
innovative and strongly performance-based, only Activity A, Responding to Student Writing 
assesses skill and knowledge that is directly related to the teaching of students in an 
English class. As they are now constructed, the content and format of the other two 
activities, Activity B, TuhbowT Discussion of literary Work and Activity C, Speaking of 
Language, seem better suited to staff development purposes, although revisions could 
possibly be made to the activities so that they more directly relate to the teaching of 
students. 



7.68 

315 



CHAPTERS: 
CONCLUSIONS 



3} 



CHAPTERS: 



CONCLUSIONS 



This final chapter contains our conclusions from our pilot testing experience during 
the second year of the California New Teacher Project. The first section describes each 
assessment approach that was pilot tested during Spring and Summer 1990, and identifies 
strengths and weaknesses. Next, cost estimates are discussed and conclusions about 
characteristics of successful training of assessors and scorers are stated. The chapter 
concludes with a discussion of policy issues that have been identified during this round of 
pilot testing. 

Assessment Approaches 

Although the purpose of the pilot tests was to use the specific instruments to learn 
about the potential of assessment approaches, the preceding chapters focused on the 
individual instruments. This section describes the assessment approach for which each 
instrument serves as an exemplar, and summarizes our conclusions about the critical 
featured as well as the strengths and weaknesses of each assessment approach. These 
conclusions are tentative for some assessment approaches, as few of the instruments piloted 
represent an assessment approach with a lengthy history with respect to teaching. 

Each instrument reflects one of four assessment approaches: structured simulation 
tasks, classroom observation, videotaped teaching episodes, ar^ a set of performince-based 
assessment center exercises. 



Structured Simulation Tasks 



Definition. This assessment approach, administered in a large group setting, 
requires a teacher to perform a task which simulates work characteristic of one or more 
teaching responsibilities. The teacher's response is then compared to a list of previously 
identified responses or response characteristics. 

Characteristics of instruments piloted. Two instruments representing a Structured 
Simulation Task approach were jted during 1990. One, the Secondary Life/General 
Science Teacher Assessment, is modeled after an assessment created by the Rand 



8.1 



f o 
IERIC 



317 



Corporation for use in the licensing of lawyers. The other, the Assessment of Competence 
in Monitoring Student Achievement in the Classroom, is based on a decade of experience in 
staff development for teachers in the area of student achievement. 

Both instruments ask the teacher to either analyze a completed teaching task, to 
outline how they would perform a task, or to actually do a brief task. The exercises which 
form the Assessment of Competence in Monitoring Student Achievement in the Classroom 
focus on either small pieces of a larger task (e.g., construct three items for a multiple-choice 
test) or on general outlines of how larger tasks (e.g., construction of an observation 
assessment to determine readiness for promotion to first grade) could be accomplished. 
Answers are relatively brief, typically answered in a sentence or two or by listing up to four 
items. The tasks in the Secondary Life/General Science Teacher Assessment focus on large 
tasks such as critiquing a lesson or planning a two-week unit. Each task is divided into 
subparts. Background material which describes the context is provided for each task in the 
Secondary Life/General Science Teacher Assessment. This requires the teacher to take 
more factors into account in developing a response, which typically is a list that may contain 
up to twenty items. 

The scoring systems for the two instruments differ somewhat. In the pilot test, 
scorers for the Secondary Life/General Science Teacher Assessment were allowed to use 
their discretion in allowing credit for answers not on the scoring key, while scorers for the 
Assessment of Competence in Monitoring Student Achievement in the Classroom were 
asked to only note, but not credit, possible additions to the scoring key. Although each 
assessment contained pieces that were similarly scored, the nature of the majority of the 
scoring criteria for the two assessments differed. The scoring criteria for the Secondary 
Life/General Teacher Assessment generally focussed on characteristics of a product that the 
teacher produced which were entirely dependent on the task and the content. In contrast, 
the scoring criteria for the Assessment of Competence in Monitoring Student Achievement 
in the Classroom most often focussed on teacher criticism of general assessment practices or 
on general principles of construction of particular types of student assessments which were 
content-free. 

Strength! and weiknanei The major strengths of the Structured Simulation Tasks 
approach to teacher assessment are: (1) ease of administration and scoring; (2) job 
relevance through a focus on application, especially in the area of content pedagogy; and (3) 
the ability to assess teaching of diverse students through use of descriptions of specific types 
of students in the stimulus materials. Structured simulation tasks can be easily 
administered on a large scale, and do not require administrators with content expertise. 

8.2 



3 IS 



Job relevance varied, but was strong for those tasks or exercises in the pilot tests which 
asked the teachers to produce or analyze some sort of product used in instruction (e.g., 
lesson description, multiple choice items), analyze a simulated classroom transcript 
illustrating effective and ineffective instructional techniques, or perform a task related to 
laboratory safety. These tasks give a clear idea of whether or not the teacher can produce 
acceptable work in the context described or whether the teacher can analyze teaching 
products or interactions of other teachers. Since beginning teachers typically create only a 
smpll portion of their instructional materials, the ability to analyze activities and materials 
is important. Specifying the type of student and/or instructional goals in the stimulus 
materials is also important for the proper design of instruction and for ascertaining a 
teacher's ability to design instruction for various types of students. 

The major weaknesses of this approach are (1) the paucity of diagnostic information 
generated by the scoring system; (2) the ability to reflect only a few specified teachr 3 
contexts, techniques, and topics in the stimulus materials without vastly increasing the time 
for administration; and (3) the difficulty or possible inability to measure many teaching 
competencies involving either nonverbal behavior (e.g., some aspects of classroom 
management, establishment of rapport) or the classroom as a whole (e.g., efficient 
management of routine activities such as collecting homework). 

While the scoring system indicates whether or not a teacher can perform the task in 
the context described with the teaching techniques described, it cannot provide diagnostic 
information as to the source of the teacher 's difficulties, i.e., did the teacher fail because of a 
lack of knowledge of the content pedagogy, knowledge of the specific group of students in 
the stimulus materials, or a lack of experience with the specific techniques cited? Thus 
while this approach may be suited to licensure decisions, it would be unlikely to provide 
sufficient information to guide choice of staff development activities. 

This methodology also assumes that either beginning teachers have the ability to 
analyze situations which do not resemble their own or that enough situations are 
represented that the teacher is not unduly penalized due to their *ck of breadth oi 
experience. It is not clear whether or not this assumption is warranted. Tailoring 
responses to the type of students, teaching techniques, and/or topic specified in the tasks 
was difficult for teachers, based on their responses and feedback evaluations. It is possible 
that these difficulties might be eliminated through improved and expande J instruction 
during teacher preparation. On the other hand, it is equally possible that the difficulties are 
characteristic of a beginning teacher who has limited experience, and that the ability to 
generalize to different students, different teaching techniques, and different topics does not 

8.3 



34 y 



fully develop until a later date. Choosing between these two alternative explanations is 
difficult until more information is derived from the current research on teaiher preparation 
and differences between beginning and experienced teachers. 

Some teaching skills are difficult to simulate, e.g., the establishment and 
maintenance cf rapport between teacher and students, efficient management of routines, 
and thus these skills are probably better assessed with other assessment approaches. 

Other possible weaknesses of the approach are in the areas of fairness across groups 
of teachers and the appropriateness across different teaching contexts. Given the limited 
diversity in our sample of teachers and the lack of alternative measures of their teaching 
skills, this issue needs further exploration before drawing any conclusions. However, our 
consultant on diversity warned that one of the sample tasks she examined had the potential 
to penalize specific groups of teachers for teaching behaviors which were effective in their 
specific context. (The specific example cited, with a citation of research attesting to its 
effectiveness, was the use of sarcasm by black teachers to motivate black inner city 
students.) She expressed concern that an approach that only focuses on one method of 
teaching or on one context might not be appropriate for these teachers who are effective in 
their specific context and who evaluate teaching techniques in light of their experience in 
that context. 

Classroom Observations (Sutyact-Matter Focus) 

Definition. A classroom observation approach to teacher assessment consists of 
observing teachers as they instruct students in their classrooms. This approach was 
reviewed in last year's report (Estes et al., 1990). A classroom observation with a subject- 
matter focus includes specific categories which examine the effectiveness of instruction in a 
particular subject. 

Characteristics of instrument piloted. The Science Laboratory Assessment 
instrument piloted is a closed system, high-inference instrument. It requires observers to 
use their professional expertise to make judgments about specific categories of teacher 
behaviors. The innovative aspect of the Science Laboratory Assessment is the inclusion of 
several categories specifically designed to assess the subject-matter pedagogy and safety 
skills of laboratory science teachers at both the elementary and secondary levels. The 
scoring system by which this subject-specific evidence was evaluated, however, only reached 
a preliminary stage of development, and needs much further refinement before the 
observation instrument could be successfully implemented. 

8.4 

ERIC 



Strengths and waknflnei The major strength of all classroom observations is job 
relevance. Classroom observations assess teachers in the process of doing their work, so 
they have high job relevance and face validity. When teachers name a method of preferred 
assessment, they usually name classroom observations. In order to achieve this strength, 
however, classroom observations, whether subject-specific or not, need clearly established 
foci for observations, and criteria for assessing the adequacy of the teacher performance 
observed. In addition, the observers must be trained to recognize similar phenomena as 
they occur in quite different contexts. 

A strength of classroom observations with a subject matter focus is that such 
observations allow assessment of some subject-specific teaching competencies which are 
difficult to assess except through direct observation, such as the maintenance of a safe 
environment for laboratory science teachers, and initiating and managing discussions among 
students in different subject areas. 

The weaknesses inherent in all classroom observations are (1) the lack of 
generalizability across teaching contexts and topics, and (2) the complexity of administration 
and observer training. Classroom observations have limited ability to sample, as 
observations are limited to a specific classroom of students, a specific lesson, and a specific 
time of year. Stable estimates of teaching competencies depend on multiple observations, 
and are not generalizable across lesson types, subjects and grade levels (Stodolsky, 1988, p. 
12). 

Although all classroom observations involve some administrative complexity, 
observations with a subject matter focus increase the complexity. For a valid assessment, a 
careful match must be made between observers and teachers with respect to grade level 
(e.g., elementary, secondary), subject matter (e.g., life science, physical science), and 
availability. Enabling observers to make comparable judgments while watching instruction 
in differing teaching contexts and on different topics requires complex and lengthy training 
if high inference observation instruments such as the two pilot tested are used. 

One area of difficulty experienced in the Spring pilot test may be a symptom of an 
additional limitation of high inference classroom observations in general, but particularly 
those observations like the Science Laboratory Assessment which attempt to assess both 
general pedagogical skills and subject matter skills. The Science Laboratory Assessment, a 
subject-specific instrument, requires the observers to rate more domains than the 
Connecticut Competency Instrument (CCD, an instrument which focuses on general 
pedagogy. Observers using the former instrument seemed to have some difficulty not 

8.5 



351 



experienced by observers using the latter instrument in gathering sufficient evidence to 
support judgments for each domain. Some of this difficulty could probably be addressed by 
further refining the training for gathering evidence and with additional experience in 
administering the instrument. However, it is possible that there are limits to the number of 
domains that can be assessed through a single observation using instruments like the two 
piloted, which are extremely high-inference in nature. 

Videotaped Teaching Episodes 

Definition. This approach to teacher assessment requires a teacher to respond to 
questions pertaining to videotaped scenarios of teachers instructing students in a variety of 
contexts. Some supplementary material (in this case, stories read by students) may be 
provided. 

Characteristics of inetrument piloted. One instrument representing the Videotaped 
Teaching Episodes Approach, the Language Arts Pedagogical Knowledge Assessment 
(LAPKA), was pilot tested. This instrument centers around videotaped scenarios which 
vary in type of elementary language arts instruction, grade level, and group size. Each 
scenario is broken down into short segments. After previewing the questions for a 
particular segment, teachers view that segment once and respond to the questions with 
short-answer written responses. The questions ask the teachers to describe important 
features of the content pedagogy represented in the videotape, evaluate the effectiveness of 
these methods, and extend the principles inherent in the methods to suggest ways of 
improving or changing the methods shown. 

The structure of presentation and response used in LAPKA contrasts with other 
possible variations within this assessment approach. For example, the Stanford Teacher 
Assessment Project piloted four assessments in four different subjects (elementary literacy, 
elementary mathematics, secondary biology and secondary history) using a videotape 
stimulus to identify master teachers. In this variation on the Videotaped Teaching Episodes 
approach, teachers viewed a brief set of videotaped lesson segments, and then responded in 
a semi-structured interview to the material they viewed. 

LAPICA's use of a single monitor differed from yet another variant on the Videotaped 
Teaching Episodes approach used by David Berliner (1989), where teachers ranging in 
experience viewed three monitors showing the same lesson from different camera angles and 
commented on what they saw. Berliner's experience suggests that the LAPKA approach 
was more suitable for beginning teachers, as his beginning teachers experienced difficulties 

8.6 



in observing the multiple monitors simultaneously, while the experienced teachers were able 
to effectively use all three to interpret events. The LAPKA format of having teachers 
respond to specific questions which are previewed before seeing each videotaped segment is 
probably also helpful, as Berliner's beginning teachers had difficulties in focusing their 
attention during viewing of the videotape, particularly in distinguishing typical from 
atypical events and important from unimportant information. 

Strength* and weaknesses The strengths and weaknesses of the Videotaped 
Teaching Episodes approach are more difficult to identify than those of other assessment 
approaches because the assessment approach is relatively new, and a scoring approach that 
fully capitalizes on the use of the medium has yet to be developed. Our identification of 
strengths and weaknesses is, therefore, tentative. 

The strengths of the Videotaped Teaching Episodes approach are: (1) job relevance 
through the actual portrayal of teachers in action; and (2) the ability to assess specific 
teaching knowledge which is difficult to assess using other assessment approaches. 
Teachers are asked to describe and/or evaluate a series of videotaped segments showing 
teachars instructing their students, as opposed to a stimulus of written outlines of lessons or 
simulated transcripts. The videotape stimulus is especially good for evaluating some aspects 
of knowledge of teaching, e.g., the ability to know when a student's nonverbal responses 
indicate that s/he is becoming too frustrated, and the ability to analyze teacher-student 
interactions. While assessing these abilities is a particular strength of Videotaped Teaching 
Episodes, it is important to remember that there is much information possessed by th 3 
teacher in the videotape (i.e., knowledge of individual students and school context) that 
cannot be fully communicated to the teacher being assessed. 

The major weaknesses of the Videotaped Teaching Episodes are (1) the expens e and 
complexity of development associated with videotaping lessons; (2) the difficulty in assessing 
certain teaching competencies; and (3) a dependence on technology for administration , 
Development of suitable videotapes is a complex process and can be very expensive. The 
film and videotaping equipment and the extensive editing required to produce high quality 
videotapes contribute to the expense. The production of suitable lesson segments is a 
complicated process. Scripting can produce artificiality, while naturalistic videotaping may 
not produce results which lend themselves to assessment. 



8.7 



Due to technical difficulties in sound and field of vision, the videotaping is best 
suited for small groups of students, and not entire classrooms. Therefore, such 
competencies involving the classroom as a whole, such as maintenance of behavior standards 
and keeping students engaged, are difficult to evaluate except with multiple monitors, which 
Berliner (1989) finds are unsuitable for beginning teachers. 

The administration of the assessment is moderately complex, due to the reliance on 
technology. The assessment rooms must be set up to ensure that each teacher being 
assessed has an equally clear view of the monitor, and that the sound is audible to all 
teachers. In addition, equipment failure, though likely to be rare, has severe consequences 
for the assessment, either extending or canceling the assessment administration. 
Equipment failure can be minimized through pretesting equipment and making 
arrangements for backup equipment and a technician. 

Performance-Based Assessment Center Exercises 

Definition, Performance-Based Assessment Center Exercises have two main 
characteristics: (1) they bring teachers together at a central place to participate in a series 
of activities, each of which uses a different methodology to measure a distinct teaching skill; 
and (2) the activities require the teachers to directly demonstrate some skill which can be 
assessed by evaluating either the performance or the product produced, depending on the 
focus of the activity. 

Characteristics of instrument piloted. One instrument representing the 
Performance-Based Assessment Center Exercises approach was pilot tested: the Secondary 
English Assessment. This instrument consists of three activities. The first activity asks 
teachers to respond to two samples of student writing to demonstrate their ability to 
analyze student writing and to communicate their analysis of the writing both to the 
student and to peers. The second activity uses a small group diGcussion format to measure 
a teacher's ability to analyze a text and participate in a group discussion. The third activity 
asks teachers to deliver an extemporaneous speech on a given issue pertaining to the use of 
language in the classroom to measure a teacher's speaking ability with respect to important 
issues in English instruction. Each activity was scored using a holistic scoring process. 

Strengths and weaknesses. The major strengths of the Performance-Based 
Assessment Center Exercises are (1) job relevance through a direct focus on specific 
teaching abilities; (2) the in depth measurement of a small number of distinct teaching 
abilities; (3) the possibility of multiple measures of a single teaching competency using 

8.8 
Or 4 



different methodologies; and (4) the ability to assess the teaching of specific groups of 
students specified in the stimulus materials. 

Although job relevance is a major strength of Performance-Based Assessment Center 
Exercises, the realization of this strength depends on the ability to simulate the skills being 
measured outside the classroom. For instance, in the instrument pilot tested, the Secondary 
English Assessment, the activity asking a teacher to respond to student writing samples is 
very similar to what a teacher does in the classroom. In contrast lie activity where 
teachers discuss a piece of literature measures a teacher's abilities to interpret a text and to 
participate in a group discussion about the text. While these abilities are important skills 
that English teachers should have, a more relevant job skill would be the ability to teach 
students to interpret and discuss literature. 

Performance-Based Assessment Center Exercises focus on only a few specific 
teaching abilities, but measure multiple aspects of these abilities. This approach thus has 
the potential to provide rich diagnostic information for the few abilities assessed. In 
addition to assessing abilities, it has the potential to measure other more general domains of 
knowledge of teaching, such as knowledge of students, through several different 
methodologies, decreasing the chances of mismeasurement due to deficiencies in skills 
relating more to a specific methodology (e.g., verbal fluency with adults, in the case of Semi- 
Structured Interviews) rather than to the knowledge being assessed. 

Like Structured Simulation Tasks, the Performance-Based Assessment Center 
Exercises assessment approach facilitates the assessment of the ability to teach specific 
types of students described or reflected in the stimulus materials* For instance, all of the 
stimulus materials for the Secondary English Assessment - i.e., the samples of student 
writing, the literature discussed, and the topic of the extemporaneous speech reflect the 
teaching of students from diverse cultural and linguistic groups which is typical of the 
majority of California classrooms. 

The weakness of the Performance-Based Assessment Center Exercises approach lies 
in three areas: (1) the inability to assess teaching competencies whir > lifficult to 
simulate, (e.g., a teacher's rapport with students or the establishms iassroom 
routines); (2) the small number of teachers that is assessed per assessor, when simulations 
using small groups or individual presentations are used; and (3) the complexity of 
scheduling candidates when only a limited number can be assessed through one or more of 
the activities in the set of exercises. 



8.9 



Guideline* for the Design of Training 



During the first two years of pilot testing, we have observed a variety of training 
sessions for assessors, observers, and scorers. In reflecting over the strengths and 
weaknesses observed in all these sessions, we have identified some guidelines for the design 
of effective training. These guidelines will not seem profound, especially to educators, since 
they are simply principles of good instruction. They may even seem trivial or obvious, but 
many of the problems we have observed in training can be traced to the failure to follow one 
or more of these guidelines. 

First, there should be clearly specified performance criteria for trainees in any role in 
assessment administration or scoring, in terms of what the trainees should be expected to be 
abl' *•» do upon completion of the training. Examples of such criteria are; 

• For ass' ^or training, the ability to identify instances where teacher responses to 
quest}' » require a followup question for clarity; 

• For observer training, recognizing the same phenomenon in different teaching 
contexts or lewson types; and 

• For scorer training, the ability to accurately match teacher responses to 
predeteruined correct responses, even though the teacher responses might be 
phrased in completely different ways. 

As with scoring criteria, these performance criteria may be more general in the early 
stages of an assessment instrument's development, but the lack of specific criteria generally 
means that the assessment is not clearly conceived. An erroneous or even too strict 
specification of criteria is preferable to criteria which are too general. Inappropriate cri^ ria 
become obvious when they are applied, and the nature of the dissatisfaction frequent!,- 
points to more appropriate criteria; however, this guidance is not available when criteria are 
vague. Typically, this process of applying and then revising criteria seems to take at least 
two or three iterations. 

The training itself should be focused on instructing observers, assessors, or scorers to 
develop the skills needed to meet these performance criteria. In order for it to do so 
effectively, the following are necessary: 



8.10 



• Clear definition of all terms and criteria, with specific examples, to clearly 
communicate expectations and standards. 

• Sample responses spanning the range of likely occurrences to illustrate judgments 
to be made, such as conditions requiring intervention by an assessor or the 
application of scoring criteria. 

• Opportunities for the trainees to voice their opinions about good teaching with 
respect to the judgments they will be making so that the trainers can compare 
these perceptions against assumptions built into the assessment instrument. 
Sometimes, the trainees need to be guided to broaden their conception of good 
teaching or to redefine their standards to allow for fair application across different 
teaching contexts. 

• Provision for independent practice in applying the skills being learned, together 
with the monitoring of individual performances and subsequent adjusting of 
instruction, if necessary. 

• Provision for individual formative feedback to each trainee on their performance 
and a summative assessment at the end. 

These guidelines are based on strengths and weaknesses observed in training 
sessions for the two years of pi 1 ' tests, and provide a framework for planning and 
evaluating training for various tasks in administering and scoring assessments. 

Cost Estimates 

Most of these assessment approaches, particularly those yielding diagnostic 
information which might inform staff development choices, are considerably more expensive 
than multiple choice tests. Our estimates of the per teacher cost of administering and 
scoring assessments such as these pilot tested range from $ 36 per teacher for the 
/ ssessment of Competence in Monitoring Student Achievement in the Classroom to $ 134 
per teacher for a single observation using the Science Laboratory Assessment. In general, if 
the developmental work recommended is done, the less expensive assessments would be 
suitable for licensure decisions based on the teaching competencies measured by the 
assessment instrument, but produce limited diagnostic information to inform staff 
development choices. 



8.11 

357 



Policy Issues 

The ultimate goal of the California New Teacher Project is not the identification of 
better assessment instruments, but the improvement of teaching in schools. Thus, the 
major criterion by which the assessment approaches which have been pilot tested should be 
judged is their cost effectiveness at improving the instruction in the California public 
schools. Obviously, these assessment instruments cannot bear this burden alone. If an 
additional credentialing requirement involving one or more of these tests is implemented, it 
will need to be coordinated with other reform efforts, such as the implementation of the 
subject-specific Curriculum Guides and Frameworks and the California Standards for 
Beginning Teachers as well as any statewide programs of new teacher support. 

Evaluation of the cost-effectiveness of different assessment approaches in improving 
instruction, and the identification of ancillary reforms needed to insure this effectiveness, 
requires consideration of policy issues beyond the scope of this report, which focusses on the 
strengths and weaknesses of individual instruments and assessment approaches in 
measuring teaching competencies. The j assessment Component of the California New 
Teacher Project: First Year Report contained recommendations for policy decisions needed 
to guide the choice of assessment instruments as an additional credentialing requirement for 
new teachers. Our experience with the pilot tests described in this report leads us to 
propose the following additions and/or revisions to that list: 

• Assessment focus. In the first year report, we identified this area as a major 
decision to be made in the design of an assessment system. In this report, we 
wish to augment that recommendation. We have pilot tested five assessments in 
the second year of the project in addition to the four pilot tested during the first 
year, and have yet to identify any one assessment approach which does not exhibit 
a weakness in at least one important area of teacher competence. Multiple 
assessment approaches are needed to assess a wide range of competencies or the 
state will need to identify one or two areas of major interest (e.g., content 
pedagogy, classroom management), 

• Appropriateness for Beginning Teachers, When teacher performance across 
multiple assessments is examined, some common weaknesses appear in the 
following areas: sequencing instruction, choosing appropriate representations of 
content, the breadth and depth of content knowledge, and designing instruction 
for different types of students. Given that a literature on the development of 
teaching skills is just beginning to emerge, we do not know if these weaknesses 

8.12 

ERIC °' } " 



are developmental weaknesses typical of beginning teachers or are results of 
weaknesses in the current curriculum for the preparation of teachers. How are 
new teacher* to develop these skills? If teacher preparation programs need to 
change, what is the incentive, and how can the programs be assisted in their 
efforts to change, as well as monitored to ensure that change occurs? What would 
be done with teachers whose preparation occurred in another state? If it is 
expected that these skills develop on the job with additional experience, what 
policies and/or programs could facilitate their development? Are more complex 
assessments of new teachers worth implementing if these skills are ignored? 

• Coordination with professional development Last year, we noted that one 
decision to be made with respect to the design of an assessment system was the 
extent to which credentialing assessments should be coordinated with staff 
development activities. We can now better describe the impact of that decision on 
the choice of assessments. Based on our comparison of assessment approaches, 
we note that some approaches, such as Classroom Observations, would be well 
suited for providing information to guide staff development, while others, such as 
Structured Simulation Tasks, would not. 

• Teaching of diverse students. A teaching credential licenses a teacher to teach in 
any classroom in the state. However, beginning teachers possess in-depth 
knowledge of only a limited range of students, chiefly those experienced in student 
teaching and the first year(s) of teaching. Assessments which tap a depth of 
knowledge of teaching probably need to focus on the students with whom the 
teacher is familiar. Caution needs to be taken with assessments which focus on 
teaching specific types of students, as thesb might differentially advantage 
teachers according to the degree of experience with each type of student, making 
fairness difficult to maintain. In addition, thought needs to be given to the 
identification and retention of teachers who excel in contexts where student 
achievement is typically low, even if they do not possess the breadth of knowledge 
of students which would enable them to teach effectively in other teaching 
contexts. However, licensing teachers to teach in specific contexts as well as in 
specific content areas would be a logistical nightmare. The degree to which the 
limited experience of beginning teachers is balanced against an interest in 
assessing a teacher's ability to teach in multiple teaching contexts is an important 
policy issue. 



8.13 

35: fi 



The policy decisions outlined in this report and the previous report will affect the 
design of any system for assessing new teachers, which will be contained in a report to the 
State Legislature in 1992. The present and future reports analyzing the pilot testing 
conducted in the three years of the Assessment Component of the California New Teacher 
Project provide information on the strengths and weaknesses of a number of different 
approaches to teacher assessment as well as specific instruments representing these 
approaches. 



8.14 



ERIC ° M 



BIBLIOGRAPHY 




BIBLIOGRAPHY 



Berliner, BethAnn, Mate, Susana, Zalles, Dan, and Little, Judith Warren. (1987). Improving 
student teaching through clinical supervision. Volume two: Supervision and support 
through the eyes of student teachers and first year teachers. San Francisco, CA: 
Far West Laboratory for Educational Research and Development. 

Borko, Hilda, Lalik, Rosary, Livingston, Carol, Pecic, Kathleen, and Perry, Diana (1986). 
Learning to teach in the induction year: Two case studies. Paper presented at the 
annual meeting of the American Educational Research Association. 

Boyer, Ernest L. (1983). High school: A report to the Carnegie Foundation for the 
advancement of teaching. New York: Harper & Row. 

California Commission on the Teaching Profession. (1985). Who will teach our children? 
A strategy for improving California schools. Sacramento: California Commission on 
the Teaching Profession. 

California State Department of Education. (1985). Mathematics framework for California 
public schools, kindergarten through grade twelve. Sacramento: California State 
Department of Education. 

California State Department of Education. (1986). Recommended readings in literature, 
K-8. Sacramento: California State Department of Education. 

California State Department of Education. (1988). English-language arts model curriculum 
guide: Kindergarten through grade eight Sacramento: California State Department 
of Education. 

California State Department of Education. (1990). Science framework for California public 
schools, kindergarten through grade twelve. Sacramento: California State Department 
of Education. 

Colley, R. and Lujan, P. (1982). A structural analysis of speeches by Native American 

students. In F. Barkin, E. Brandt, & J. Ornstein-Galacia, Bilingualism and language 
contact New York: Teacher College Press. 

Commission on Teacher Credentialing. (1985). Alternative models of candiate-centered 
assessment of teacher education programs. Sacramento: Commission on Teacher 
Credentialing. 

Commission on Teacher Quality. (1983). Report on teacher quality. Report to the 
California State Senate. Sacramento: Commission on Teacher Quality. 



B.l 



ERIC 3B2 



Delpit,L. (1986). Skills and other dilemmas of a progressive black educator. Harvard 
Educational Review, 56 (4), 379-385. 

Delpit, L. (1988). The silenced dialogue: Power and pedagogy in educating other peoples' 
children. Harvard Educational Review, 58 (3), 280-298. 

Estes, Gary D., Stansbury, Kendyll, and Long, Claudia. Assessment component of the 
California new teacher project: Fint year report San Francisco: Far West Laboratory 
for Educational Research and Development, March 1990. 

Foster, M. It's cookin now: A performance analysis of the speech events of a black teacher 
in an urban community college. Language in Society, 18 (1), 1-29. 

Goodlad, John I. (1984). A place called school: Prospects for the ftiture. New York: 
McGraw-Hill. 

Grant, Carl and Zeichner, Kenneth. (1981). Inservice support for first year teachers: 

The state of the scene. Journal of Research and Development in Education, 14:99-111. 

Graves, D. (1983). Writing: Teachers and children at work. Portsmouth, NH: Heinemann. 

Heath, S. (1983). Ways with words. Cambridge: Cambridge University Press. 

Hollins.E. (1982). The Marva Collins story revisited. Journal of Teacher Education, 33 (1), 
37-40. 

Holmes Group, Inc. (1986). Tomorrow % teachers: A report of The Holmes Group. East 
Lansing, MI: The Homes Group. 

Kleinfeld, J. (1974). Effective teachers of Indian and Eskimo high school students. In 

J. Orvik & R. Bernhardt (Eds.), Cultural influences in Alaskan native education. Center 
for Northern Educational Research, University of Alaska, Fairbanks. 

Ladson Billings, G. (1990). Culturally relevant teaching. The College Board Review, 155, 20- 
25. 

Leinhardt, Gaea. (1989). Math lessons: A contrast of novice and expert competence. Journal 
for Research in Mathematics Education, 20, 52-75. 

Michaels, S. and Cook-Gumperz. (1979). A study of sharing time with first-grade students: 
Discourse narratives in the classroom. Proceedings of the fifth annual meeting of the 
Berkeley linguistics Society. Berkeley, CA: University of California. 

Moir, E. (1990). Article in New Teacher News (Newsletter of CNTP), 3 (1), October 1990. 

Odell, Sandra. (1986). Induction support of new teachers: A functional approach. Journal of 
Teacher Education, 26-29. 



B.2 



3r>3 



Ryan, Kevin. (1980). Biting the apple: Professional induction of beginning teachers. 

Quellmalz, E. (1985). Needed: Better methods for testing higher-order thinking skills. 
Educational Rasearche?, 17 (5), 5-14. 

Shulman, Lee. (1987). Knowledge and teaching: Foundations of the new reform. Harvard 
Educations! Review, 57, 1-22. 

Shulman, L.S. and Sykes, G. (1986). A national board for teaching? In March of a bold 
standard. A report for the Task Force on Teaching as a Profession. New York: 
Carnegie Corporation. 

Stiggins, R., Conklin, N., and Faires & Associates. (In pross). Classroom assessment: A 
task analysis. Albany, NY: SUNY Press. 

Southwest Regional Education Laboratory. (1990). 1988-1989 Evaluation Report Los 

Alamitos, CA: Southwest Regional Education Laboratory. 

Southwest Regional Education Laboratory. (1991). 1989-1990 Evaluation Report Los 

Alamitos, CA: Southwest Regional Education Laboratory. 

Taylor, 0. and Lee, D. (1987). Standardized tests and African-American children. 
Negro Educational Review, 38 (2-3), 67-80. 

Veenman, Simon. (198 4). Ferceived problems of beginning teachers. Review of Educational 
Research, 54, 143-178. 

Watkins, Richard. (1985). A practitioner review of the content validity and pel sing 

standards of the California Basic Educational Skills Test Sacramento: Commission 
on Teacher Credentialing. 

Wheeler, P. (1986-87). The relationship between grade six test scores and the length of 
the school day. Educational Research Quarterly, 11 (3), 10-17. 

Wilson, Suzanne. (1988). Understanding historical understanding: Subject matter knowledge 
and the teaching of teachers. A dissertation submitted to Stanford University. 

Wise, Arthur, Darling-Hammond, Linda, Berry, Bamett, Klein, Stephen P. (1987). licensing 
teachers: Design for a teacher profession, Santa Monica, CA: The Rand Corporation. 



B.3 



APPENDIX Ai 

STATISTICAL COMPARISON OF TEACHER PERFORMANCE ON THE SECONDARY 
LIFE/ GENERAL SCIENCE TEACHER ASSESSMENT 



oh.) 



17:14 Thursday, Octobor 18, 1990 15 



CTC PILOT TEST ANALYSES: RAND DATA 
Program: F:\DATA\SAS\PROGRAMS\RAND_A.SAS 
Input Data: F:\DATA\SAS\DATASETS\RAND_DAT.SSD 
Output File: F:\DATA\SAS\OUTPUT\RAND_A.OUT 

Analysis A: Descriptive statistics 
Part 4: Task Level Scores summed across raters by FORM 



Form=Form A 



Applying Effective Tca 2^c?L C ffii5 Ulim 



. B i m;<t /Student _te 1 1 e r 



T1SS: Task 1, total, Scorer 
Sum 



T2SS: Task 2, total, Scorer 
Sun 



T3SS: Task 3, total, Scorer 
Sum 





HEAN | STD | H |MIN 


MAX 1 HEAN | STD | N |HIN | MAX | MEAN | 


STD | N |MIN | 


MAX 


Candidate's gender 


7 .36 


1.75 


11 


4 


10 


72.91 


12.91 


11 


44 


90 


13.09 


6.95 


11 


2 


22 




2 . 19 1 2l| 0 


10| 77.10{ 12.81| 21 | 50 








Candidate's race 
- Non-white 


7 .25 


| 7 .08 


3.06 
| 1,64 


8 

h ■ 

24 


0 

i 

« 


10 

. 

10 


81 .75 
73.62 


5.28 
13.96 


8 

h H 

1 2* 


72 

y i 

«« 


88 
100 


17.25 

h •-• 

13.17 


6.10 
5.92 


8 

1 24 


9 

h ■ 

2 


24 

1 23 


Teacher's prep 


0.00 




1 


0 


0 


72.00 




1 


72 


72 


9.00 




1 


9 


9 




| 7.461 1.721 24| 4| 10| 74.42| 13.79| 24 






:~u,llV n ~- "" i »•«! '! 6 l '! »•»" L^^^!U~l---?!----"U»UU--~ 


candidate's grade 


7.70 


1 .25 


i 

i .. 


6 


10 


79.00 


16. 10 


10 


44 


100 


15. 30 


6.8f) 


10 


6 


24 



((•OMTINIIKI)) 



ERIC 



3fn> 



3G7 



17:14 Thursday, October 18, 1990 16 



CTC PILOT TEST ANALYSES: RAND DATA 
Program: F: \DATA\SAS\PROGRAMS\RAND A.SAS 
Input Data: F:\DATA\SAS\DATASETS\RAND DAT.SSD 
Output File: F:\DATA\SAS\OUTPUT\RAND_A.OUT 

Analysis A: Descriptive Statistics 
Part 4: Task Level Scores summed across raters by FORM 

« For m= Form A 



3CS 





T1SS: Task 1, total, Scorer 
Sum 


T2SS: Task 2, total, Scorer 
Sum 


1 T3SS: Task 3, total, Scorer 
Sum 


MEAN 1 STD 1 N |MIN |MAX | MEAN | STD | N | MIN | MAX | MEAN | STD j N |MIN | MAX 


Candidate's grade 
- Junior high school 


6.00 


2.93 


8 


0 


10 


74.87 


9.22 


8 


61 


90 


15.00 


5.88 


8 


8 


23 


- Continuous, Regular, JHS/HS | 7 . 36 1 1.74 1 14 1 4| 10 


7 3.71 | 12.33 | 14 | 50 1 90 1 12.93 | 5.99) 14 


1 2| 23 


Candidate's grade, collapsed 
- Middle, Junior high school 


6.94 


2. 26 


16 


0 


10 


77. 17 


13.29 


18 


44 


100 


15. 17 


6.26 


18 


6 


24 


- Continuous, Regular, JHS/HS | 7 . 36 1 1.74 


14 1 4| 10 1 73.71 | 12 . 33 1 14 1 50 1 90 1 12.93 | 5.90 1 14 1 2| 23 


Candidate's location 
- Inner City 


7 .00 


1 . 50 


9 


4 


9 


82.33 


10.02 


9 


64 


100 


16.56 


5.96 


9 


6 


24 


- Others 1 7.17| 2.23| 23 


0| 10 1 73.04| 12 . 99 1 2 3 } 44 j 97 | 11.26) 6. 1U | 23| 2| 23 


Candidate's age 
- ? ( ) or below 


7.72 


1.45 


18 


5 


10 


76.22 


9.85 


18 


60 


97 


13.83 


6.64 


18 


2 


23 


. Ml-14 1 6.82 | 1.78| ll| 4| 10 j 76.64 j 1 4 . 7 9 j llj 50 j 100 j 15.00| 5.2?| llj 6| 24 



< c « >rj I iiiiii:i>) 

9 

ERIC 



17:14 Thursday, October 18, 1990 17 



CTC PILOT TEST ANALYSES: RAND DATA 
Program: F: \DATA\SAS\PROGRAMS\RAND_A. SAS 
Input Data: F:\DATA\SAS\DATASETS\RAND_DAT.SSD 
Output File: F:\DATA\SAS\0UTPUT\RAND_A.OUT 

Analysis A: Descriptive Statistics 
Part 4: Task Level Scores summed across raters by FORM 



For m= Form A 



T1SS: Task 1, total, Scorer I T2SS: Task 2, total, Scorer I T3SS: Task 3, total, Scorer 
Sum ! Sum ^ ^ | sum ^ 

"mEmTT'sT^ STD | N | MIN |MAX | MEAN | STD | N |NIN |MAX 

y ' + + + + ♦ ♦ + + + "" 4 + ^ 



Candidate's aqe 
- 35 or above 
ALL 



4.67 4.16 ^ 
| 7.12| 2. 



66.67 


23. 18 


+ 


k + 


| 75.66 


12.79| 



44 



90 



13 .33 



8. 39 



23 

4 4 



"o3|"32|" 0| 10| 75.66| 12.79| 32| 44| 100| 14 . 19 1 ^6.15 | _ 32 1 ^ 2 | 24 



371) 



9 

ERIC 



371 



17:14 Thursday, October 18, 1990 18 



CTC PILOT TEST ANALYSES: RAND DATA 
Program: F:\DATA\SAS\PROGRAMS\RAND_A.SAS 
Input DatM F:\DATA\SAS\DATASETS\RAND_DAT.SSD 
Output File: F:\DATA\SAS\OUTPUT\RAND_A.OUT 

Analysis A: Descriptive Statistics 
Part 4: Task Level Scores summed across raters by FORM 



Ferm=Form B 

Applylnq Effective 
i ns.tr.wJ; iaiuJ_ leduu qubs Ifi^SfiQ JUanaillfl-. 



Classroom an<] 
Jj)x;iLUy-Satiily. 





T1SS: Task 1 # total, Scorer 
Sum 


T2SS : Task 2, total, Scorer 
Sum 


T3SS: Task 3, total, Scorer 
Sum 


MEAN | STD | N j MIN |MAX 


MEAN 


STD | N |HIN |HAX | MEAN | STD | N 


|MIN |MAX 


Candidate's gender 
- Male 


8.86 


3.21 


14 


4 


14 


3.29 


1.44 


14 


0 


5 


12.46 


3.20 


13 


7 


16 


- Female | 8.1l| 2. 45 | 19 


4| 12 | 4.05 


1 . 54 | 19 


2| 7| 13.05| 3.32 


| 19| 6| IB 


Candidate's race 
- Missing 


8.00 




1 


8 


8 


3.00 




1 


3 


3 


10.00 




1 


10 


10 


- Non-White ) 8.00| 3 . 42 1 8| 4| 14 | 3 . 62 | 1 . 60 1 8| 2| 6| 11. 38 | 3 . 34 | 8| 6| 16 


- White 


8.58 


2 . 65 1 24 | 4 j 14 1 3 . 79 | 1 . 56 | 24 | 0| 7 | 1j.43 


3 . 1 2 | 2 3 | 


7| 18 


Teacher's prep 
- Regular 


8.54 


2.72 


24 


4 


14 


3.71 


1.65 


24 


0 


7 


13.39 


3.16 


2 1 


7 


18 


- Intern | 7.29| 2.75 


7| 4| 11| 3 . 7 1 | 1 . 38 1 7| 2\ 5| 


10<43j 2 . 82 | 7| 6| 14 


- Other | 11.00| 2 . 8 1 1 2| 9| 1 3 1 4 . 00 1 0.00| 2| 4| 4| 14 . 50 1 2 . 1 2 1 2| 1 3 1 16 



4* 



37 > 



ON I INUF!!)) 



ERIC 



17:14 Thursday, October 18, 1990 19 



CTC PILOT TEST ANALYSES: RAND DATA 
Program: F:\DATA\SAS\PROGRAMS\RAND_A.SA5 
Input Data: F:\DATA\SAS\DATASETS\RAND_DAT.SSD 
Output File: F:\DATA\SAS\OUTPUT\RAND_A.OUT 

Analysis A: Descriptive Statistics 
Part 4: Task Level Scores summed across raters by FORM 

Form-Form D 





T1SS: Task 1, total, Scorer 
Sum 


T2SS : Task 2, total, Scorer 
Sum 


T3SS: Task 3, total, Scorer 
Sum 


MEAN 1 STD 1 N |MIN |MAX 1 MEAN | STD | N |NIN |HAX | MEAN | STD | N |MIN |MAX 


Candidate's grade 
- Middle 


7.86 


1.86 


7 


6 


10 


4.00 


1.91 


7 


2 


7 


12.14 


3.39 


7 


7 


16 


- Junior high school | 8.25 | 3.33 


12| 4| 14 


3.671 1.30| 12 | 2| 6| 11.09| 3.2l| ll| 6| 16 


- Continuous, Regular, JHS/HS | 8.86| 2.74| 14 j 4j 14] 3.64| 1.60| 14| 0| 6j 14.50J 2.44|__14 




Candidate's grade, collapsed 
- Middle, Junior high school 


8.11 


2.83 


19 


4 


14 


3.79 


1.51 


19 


2 


7 


11 .50 


3.22 


18 


6 


16 


• Continuous, Regular, JHS/HS 


8.86 


2.741 14 | 4| 14 | 3.64| 1,60 


14 1 0| 6| 14.50) 2 . 44 | 1 4 1 8| 18 


Candidate' s location 
- Inner city 


8.33 


j.24 


9 


4 


14 


3. 22 


1 .09 


9 


2 


5 


12.13 


3.48 


8 


7 


16 


- others 1 8 ' 46 


2.651 24 1 4| 14| 3 92 | 1.64 


24 j Oj 7j 13 . 04 | 3.20 


24 


6| 18 


CriruJi date's age 
- 2 f ) or below 


9,06 


2.43 


16 


4 


14 


3 . 62 


1.6 3 


16 


0 




12, 69 


2.91 


16 


6 


16 



BEST COPY AVAILABLE 375 



( ( (HIT I NlMJ)) 



371 



9 

ERIC 



17:14 Thursday, October 18, 1990 20 



CTC PILOT TEST ANALYSES: RAND DATA 
Program: F:\DATA\SAS\PROGRAMS\RAND A.SAS 
Input Data: F:\DATA\SAS\DATASBTS\RAND_DAT.SSD 
Output File: F:\DATA\SAS\OUTPUT\fc JDJA.OUT 

Analysis A: Descriptive Statistics 
Part 4: Task Level Scores summed across raters by FORM 

Form-Form B 



TISS: Task 1, total, Scorer I T2SS: Task 2, total, Scorer j T3SS : Task 3, total, Scorer | 
Sum | Sum I Sum 





MEAN 1 STD | N |MIN |MAX | MEAN | STD | N 


HIN |MAX | MEAN | STD | N |MIN |MAX 


Candidate's age 
- 30-34 


6.17 


2.04 6 


4 


9 


3 .83 


1.60 


6 


2 


6 


15,33 


2.07 


6 


12 


18 


- 35 or above 


8. 73 | 3.13| 11 


4 | 14 1 3.82| 1.47 j 11 j 2| 7| 11 .50| 3.66) 10| 7| 16 



{ bT^I 2 . 77| 33| 4| 14 j 3.73 j 1.53 j 33 ) 0| 7| 12 .83 | 3 . 24 1 32 1 6| 18 



377 



APPENDIX Bi 

SCIENCE LABORATORY ASSESSMENTS CONTENT AND FORMS 



378 



California New Teacher Project 
Science Laboratory Awearment 

Domains, Elements, & Indicators 



RMC Research Corporation 
Mountain View. California 



Pedagogy 



Al. Planning - The objective(s) for the activity involves the development or utilization of 
one or more of the scientific thinking processes (i.e.. observing, communicating, 
comparing, ordering, categorizing, relating, inferring, and applying). The objective(s) 
and the activity are not focused solely on facts, but also on concepts and processes. 
The teacher knows what prerequisite skills and knowledge are required for the planned 
activity and the extent to which the students have these. The activity is an appropriate 
one for helping students to achieve the objective(s) and one that can be safely 
implemented with the students, given the facilities, equipment, and materials that are 
available. 



A2. Sequence - The teacher organizes the steps or tasks of the laboratory activity in a 
logical or purposeful manner that allows students to achieve the lesson objective(s) and 
to complete the activity in an effective manner. Students do not exhibit confusion, 
or incorrect procedures or conclusions that might be due to inappropriate sequencing 
of the steps and tasks in the activity. 



A3. Prdab - The teacher provides the students with a focus for the activity and a 
framework for learning. The opening is related to the teacher's objective(s). The 
teacher may: explain the purpose of the activity, help the students anticipate the 
activity, link it to the students' interests, point out the relevance of the activity to the 
students' own lives, tie the activity to prior learning in the class or to other subject 
areas, provide motivation for the students to leam from the lab activitv, or review 
background materials. 



A4. Di rections - The teacher provides clear and comprehensive directions, orally and/or in 
wntmg as needed, to the students for doing the activity. The directions are at an 
appropriate level of complexity and difficulty for the students. The teacher 
communicates expectations for students' work on the activity. 



Wanptipo/Prcscntation . The teacher provides clear and accurate explanations, 
presentations, and reviews of concept principles, definitions, and processes, as needed. 



B.l 

37 1) 



or allows students to do these. The teacher may do a brief demonstration or modeling 
of the activity, as appropriate tor the students and the lesson objectives. The 
explanations/presentations' ai e at a suitable level of complexity and difficulty tor the 
students, are visible to ail students, and are related to the objective! s) of the activity. 



A6. Monitoring/Adiusting - The teacher monitors student understanding and work during 
the activity; usually this involves walking around the room during the activity. 
Monitoring mav consist of asking students specific questions about their understanding 
of the content, methods or equipment: observing students doint the activity: and 
listening to students discuss the activity with each other and with the teacher. During 
the observation period, the teacher adjusts the lesson or activity for individual students, 
small groups, or class-as-a-whole. as needed. Adjusting can take such forms as 
providing guidance, reviewing lesson content, presenting the information in a different 
manner, clarifying information, modeling a step, or changing the sequencing of the 
steps or tasks. 



A7. Feedback - The teacher provides immediate, appropriate, and uncritical feedback to 
all students, individually or as a group, to promote attainment of the instructional 
objectives. Feedback (including feedback on wrong answers and errors) provides 
positive rewards, useful information, further motivation, or encouragement to students. 
"The teacher provides objective feedback to students regardless of ability, ethnicity, 
gender, or other characteristics. The teacher's feedback indicates that the teacher can 
distinguish among a student's response that represents a nonstandard but productive 
insight, a response that indicates confusion, and one that indicates u^athy. 



A8. Questioning - The teacher asks questions that promote higher-order thinking processes 
and achievement of the objectives. Questions can be directed to the entire class, small 
groups, or individual students. The teacher involves as many students as possible, 
regardless of gender, ethnicity, language proficiency, or intellectual ability. The teacher 
asks question at an appropriate cognitive level to encourage the development of skills 
in one or more of these processes: 



The teacher provides appropriate wait time for students to respond. The teacher 
encourages the students to develop their own questions and answers, and builds on 
student responses and comments. 



observing 
communicating 



categorizing 



relating 

inferring 

applying 



comparing 
ordering 



9 

ERIC 



B.2 

3bl) 



A9. Closure - When appropriate and in keeping with the objectives of the activity, the 
teacher provides a "wrap-up" or summary, and links the lesson objectives to past or 
future learning, or allows students to provide such closure. The teacher may review 
and summarize the purpose of the activity and what conclusions can be drawn from 
it, or allows students to provide such closure. Closure is often done at the end of a 
lab activity, but could also be done at the end of major segments completed earlier 
in the lab period or in a succeeding class period. The closure is directly related to the 
objective(s) of the lab activity. 



Content 

Bl. Accurate - The teacher presents information that is accurate, and uses scientific 
content, methods and procedures that are generally accepted in the scientific 
community. 



B2. Intecrsted - The teacher knows how the topic ot th • activity is connected to and 
interrelated with a major theme of science (e.g., ener patterns cf change, stability), 
other scientific topics, and topics in other subject aress. The teacher provides the 
means by which students can interrelate and connect the topic of the activity to: (1) 
past and future learning on this topic, (2) other scientific topics, and/or (3) topics and 
academic skills in other subject areas. 



B3. Related to Objectives - The teacher presents scientific information and uses method:, 
that are related to the objective(s) of the laboratory activity. 



3H1 

B.3 



c. 



Materials/Equipment 



CI. Teacher Use - The teacher properly uses the equipment and handles the materials 
employed in the observed laboratory activity. Live organisms are maintained and 
handled in a humane and appropriate manner. Where applicable, the teacher is alert 
to student allergies, fears, and other problems related to the use of specimens or live 
organisms in the science lab activity. 



CI. Safe Setup - The setup of equipment, furniture, and materials has no serious 
irregularities or dangerous conditions. The setting has. as needed, adequate ventilation, 
first aid supplies, safety equipment, corrosive-resistant counter tops, a fire extinguisher, 
running water, good lighting, etc. Materials and equipment are stoiea. labeled, and 
moved properly. 



C3. Safe Practices - The teacher knows about the potential dangers involved in the planned 
science laboratory activity. The teacher informs students about, checks for 
understanding of. and enforces the proper use of equipment and handling of materials, 
as needed. The teacher tells students about safety procedures, potential dangers and 
actions to take, and proper cleanup and disposal procedures. Students are wearing 
safety gear (e.g., goggles, aprons, gloves) when needed. Cleanup and disposal are 
completed in a well-coordinated and safe manner. The teacher is alert to potential 
safety problems, knows what to do if a safety problem occurs, and takes corrective 
measures when necessary. There are no observed teacher violations of state and 
federal safety laws and regulations on the setup, use. and handling of materials and 
equipment. 



C4. Availability - The teacher has provided a sufficient supply of materials and the 
necessary equipment so that all students can complete the activity and attain ihe 
lesson objective(s). The teacher prepares and modifies the equipment to be 
appropriate for the students (e.g., pre-mixes solutions, ties strings on weights). The 
teacher has all neressary materials and equipment for the lab activity available and 
ready to be used pi 'or to the start of the class period. The procedures for distributing 
and the placement if equipment are suitable for the facilities or layout of the class 
setting. The teacht* has provided students with easy and orderly access to the 
materials and equipment Provisions have been made for physically disabled students, 
when present, so that tuey are able to participate in a meaningful way. 



9 

ERIC 



3b 2 

B.4 



Management 



Dl. Grouping - The grouping of students is done in a manner that facilitates the 
completion of the activity, and the learning of the instructional objectives. The teacher 
has considered such factors as suitable size for the activity, the number and locations 
of work stations, the amount of equipment and working space available, the time 
needed to do each step, the roles assigned to various members of each group, the 
variable work rate of different students, and the grouping of students so they work well 
together. 



D2. Other Personnel - If another person(s) (e.g., aide, peer tutor) is present and if that 
person is involved with the lab activity in an instructional or a managerial role, the 
teacher supervises that person'* work as needed. 



D3. Routines and Transitions - Classroom routines (e.g., taking attendance, distribution of 
materials, pencil sharpening) and transitions (e.g., from whole-class activity to sm'aU- 
group activity) occur smoothly and efficiently. There is continual progress toward 
students completing the activity and attaining the objective(s). The amount of non~ 
instructional time is minimal. 



D4. Student Engagement - The teacher structures the laboratory activity so that most of 
the students are engaged in a laboratory task most of the time. The teacher quickly 
attempts to reengage any student who is not on task or who deviates from the 
prescribed activity. 



D5. Timing - The teacher allocates sufficient time for each step so that the students have 
an opportunity to complete the activity and attain the lesson objective(s). The teacher 
makes adjustments during the lab activity for students who complete it quickly, as well 
as for those who do not keep up with the other students. 



D6. Student Behavior - The teacher encou;uges and reinforces appropriate student 
behavior. The teacher responds to student misbehavior quickly and positively. The 
teacher asserts control and maintains order so as to facilitate a productive lab activity. 



D7. Lab Cleanup - Teacher devises, explains and implements lab cleanup procedures so 
that the lab setting is left neat and clean at the end of the activity. 



3S3 

B.5 



Knowledge of Students 



EL Diversity - The teacher tailors instructional activities for a diverse classroom of students 
with different ethnic, cultural, language, and socioeconomic backgrounds and, when 
present, disabled students. The teacher does not compromise the rigor of the lesson 
and each student is challenged at an appropriate level. 



E2, Student Characteristics - The teacher offers instruction and provides an activity that 
is appropriate for students* interests, cognitive and developmental levels, and prior 
knowledge. The teacher adjusts the information and activity for individual student 
differences when appropriate. The teacher is aware of student preconceptions and 
misconceptions that might interfere with the attainment of the objectives, and addresses 
them during the lab activity, as needed. The students seem to understand what is 
being taught and to be challenged by the activity and instruction. 

Climate 

Fl. Interactions with Students - The teacher interacts with all students respectfully, 
positively, equitably, and in a culturally appropriate manner. The teacher is sensitive 
to students' preconceptions and values. The teacher avoids sarcasm and criticism. 
The teacher communicates high expectations for student learning and behavior, and 
provides all students with an opportunity to participate and learn. 



F2. Interactions amon^ Students - The teacher encourages and allows for productive and 
activity-related interactions and sharing among students. The students treat each other 
respectfully and politely. 



F3. Attitudes - The teacher exhibits a positive attitude toward and enthusiasm for science. 
The teacher attempts to instill in students positive attitudes about learning and about 
science. The teacher demonstrates an attitude that the lab activity is a vital aspect ot 
the students' learning, and that the individual student s results and observations are 
important 



F4, Inquiry - The teacher fosters an environment in which the processes of science are 
important, and an environment that promoter questioning, problem solving, discussion 
of error, and evaluation of competing ideas. The teacher does not place undue 
emphasis on students' obtaining "correc'*' 1 or expected results in a laboratory activity. 
The teacher provides opportunities for discussion of anomalous results without 
embarrassing students. The teacher and students can criticize ideas without criticizing 
each other The teacher encourages students to draw their own conclusions from 
observed data and to state them in their own words. 



.6 3b-1 



Communication 



Gl, Speaking - The teacher's oral communications (e.g., presentation, directions, feedback, 
informal conversations) are coherent and clear to all students. They have clarity of 
meaning and are given in a fluent manner and with a pleasant tone. Messages are 
not vague, ambiguous or incomplete. The teacher can be clearly heard by all students 
to whom the teacher is speaking (appropriate volume, enunciation, rate). Grammatical 
errors and mispronunciations, if any, are minimal and do not interfere with 
communication. The teacher does not use slang or vulgarities. The teacher uses 
acceptable conventions of spoken language for communicating with the students. 



G2. Writing - The teacher's written communications (e,g„ handouts, materials on 
chalkboard, overhead transparencies, posters made by the teacher, displays) are clear 
to students. They are not vague, ambiguous, or incomplete (unless required by the 
lesson objectives). Written materials can be clearly read by students (appropriate level 
of difficulty, legible, visible). Errors in grammar and spelling are minimal or 
nonexistent, and do not interfere with communication or set a bad example for 
students. 



G3, Listening - The teacher listens to all students and reacts in an appropriate and 
supportive manner to their questions, answers, comments, failures to answer, errors, 
and needs. 



G4, StrenLth of Presence - The teacher sho\ys confidence with the science content, the 
methods and procedures, and the use of equipment and materials. When 
questions/situations arise that are outside of the teacher's background/experience, the 
teacher can frankly admit that and oroceed to engage students in a cooperative effort 
to learn together. The teacher uscl suitable body language and eye contact to gain 
and maintain control of the class and to hold the students' attention. 



B,7 



385 



California New Teacher Project 
Science Laboratory Aacaamcnt 



Pre-Observation Questionnaire 
(Part I of IV) 

RMC Research Corporation 
Mountain View, California 



Instructions: This form is to be completed by the teacher within 48 hoi before being observed 

Teacher Date of Observation - - 

Principal 



School Name 



School Address 



School Telephone (_ 



Date Questionnaire Completed - - 

Credential(s) Sought: 

Multiple Subject (K-8) 

Life Science 

Physical Science 



SCHEDULE 



Pre-Observation Conference 



Observation 



Post-Observation Conference 



Location 



Time 

to m 

to. 

to 



9 

ERLC 



Put this completed questionnaire and any relevant written materials (e.g., lesson plan, direction 
sheets for students, copy of students' data recording form) in an envelope, seal it. and mark the 
envelope as follows. Write your name in the space for teacher and the date of the scheduled 
observation on the line for date. 



CNTP Science Lab Assessment 



Teacber 



Date 



Leave the sealed envelope in the school office. The observer for the CNTP pilot testing will pick 
it up ep.rly on the morning of the scheduled observation. 



B.8 



Section A. The Class Being Observed 

These questions refer to the students who will be participating in the observed lab activity. 

1 Name of Course 

2. Number of Students Enrolled 3. Grade Level(s) 

3a. If you have more than one grade level in this class, how many students are at each crade 
level? (e.g., 9th-14. 10th-15) 



Are there any special needs students in this class (e.g., LEP. compensatory education, gifted, 
disabled) or any students who have behavior problems or are frequently disruptive? If yes. 
please list the number(s) of such students (e.g., 5 LEF, 3 gifted. 1 hearing impaired) and 
provide information that you think the observer should be aware of to understand 
what may be happening during the laboratory activity. 

Numbers: 



Other Information: 



5. What is the general academic ability level of students in this class? (e.g., most at grade 
level, about 1/4 one grade lower; or, all college prep or honors) 



6. Will any students be leaving or entering the room during the observation period' 
Yes No 

6a. If yes, how many are there, and do they do this on a regular basis? 
6b. How do they make up the time missed during the lab activity? 



7. Will any other persons (e.g., bilingual aide, peer tutor) be present during the observation? 
Yes No If yes, what will their roles be? 



.9 ' 387 



S. What administrative activities, no; related to the laboratory activity, will occur during the 
observation period? (e.g., taking attendance) 

9. Indicate the location of your lab activity. (Check one) 
Your regular classroom 

Your regular laboratory or classroom/lab combination 

Another classroom at your school 

Another laboratory or classroom/lab combination at your school 

Another location on the school site (e.g., yard) 

A location off the school site (e.g., beach, museum). 

10. If the location is in vour school building do vou share this location with other teachers.' 
Yes No 1 

11. Is there anything else the observer should know about your classroom and/or the students? 



Section B. The Laboratory Activity 

12. Please complete the chart on the next page. List the major instructional objective^ ), that 
is, what you want the students to be able to do as a result of this laboratory activity. For 
example: (1) Students will be able to weigh liquids. (2) Students will be able to calculate 
the density of liquids; or (1) Students will be able to focus a microscope, (2) Students will 
know how to draw a plant cell and an animal cell, showing the key structural parts, (3) 
Students will identify the structural parts associated with either plant cells or animal cells 
as well as those associated with both types of cells by comparing their two drawings. 

In the appropriate space (or box) provided for each objective briefly describe each of the 
following: 

o the steps or tasks of the laboratory activity, 
o the student grouping planned, 

o the materials and equipment planned for that objective, and 

o the safety issues and precautions related to equipment and procedures. 



3SS 

B.10 



California New Teacher Project 
Science Laboratory Assessment 





Objcctivcfsl 
(Student Outcomes) 


laboratory 
Activities 
(Steps, Tasks) 


Student Groups 
(Size, Number of 
Stations, etc) 


Materials ami 
Equipment 


Safety Issues 
Equipment and Procedures 


1 












2 

389 








3f. 


0 



Continue on the ncxl juigc. 





Object ivefs) 
(Sludenl Outcomes) 


laboratory 
Aclivitbs 
(Steps, Tasks) 


Sludenl Groups 
(Size, Number of 
Stations, etc) 


Materials and 
Equipment 


Safely Issues 
Equipment and Proeediues 


3 












4 














i 








3!) 2 



Continue on back of this sheet if >ou need more space. 



13. Source : What was your primary source(s) for this lab activity? Check those which apply. 

Textbook Professional journal or magazine 

^Another teacher An outside science educational agency (e.g., 

museum, nature reserve) 

Developed solely by you Adapted by you from any of the above sources 

Otiier (please list) 

14. Theme : Which scientific theme(s) best pertains to your laboratory activity? See the 
descriptions on page 7. Check those which apply. 

Energy — Measurement Stability 

Environment Patterns of Change Systems and Interactions 

Evolution Scale and Strut .. - 

Other (please list) 

15. Content Area : Which content area(s) best reflects this laboratory activity? See the 
attached list on page 8. - 

If other, specify area 

16. Type: Which type of laboratory best categorizes this activity? (check one) 

Discovery/Inquiry Exploratory Process Development 

Illustrative/Clarifying Introductory Skills Development 

Other (please list) _____ 

17. If there anything else the observer should know about the planned science laboratory 
activity, please write it below or on the back of this page. 

Sign and date this form. Please see the instructions in the box on the first page. 

Teacher's Signature Date Signed __- - 



List of Science Themes* 

(for use with question 14) 



1 Energy (capacity to do work or ability to make things move; the basis for reactions between 
chemical compounds; the ability of living systems to maintain their system, to grow, and to 
reproduce) 

2. E nvironment (the surrounding circumstances and conditions; the impact of externa! or 
extrinsic conditions; appreciation of one's own environment; conservation; pollution) 

3. Evolution (changes of natural entities and systems through lime; the study of the patterns 
and processes that affect these changes) 

4. Measurement (systems of measurement units; assessing dimensions, quantities, or capacities) 

5. Patterns of Change (trends; cyclical patterns; irregular changes) 

6. Scale and Structure (relationships of structures; hierarchical levels of structures and 
properties of each level; interp'ay of structure and function) 

7. Stability (constancy; a balanced steady state; static and dynamic equilibrium) 

8. Systems and Interactions (solar system; ecosystem; individual organisms; chemical and 
physical systems; input and output; feedback) 

9. Other (specify theme) 



* Includes the six themes from the California Framework Draft (September 1989 Edition) 



ERIC 



B.14 3111 



List of Science Content Areas 
(for use with question 15) 



L Life Sciences 

A- Cellular and Molecular Biology 

u. Plants, Botany 

C Protists, Monerans, Fungi 

D. Animals, Zoology 

E. Human Beings 

F. Ecosystems, Populations, Communities, Biogeochemical Cycles 

G. Genetics 

H. Evolution 

I. Other (specify area) 

II. Earth Sciences 

A. Basic Land and Water Forms 

B. Structure of Rocks and Minerals 

C. Structure of Solar System. Planetary Systems 

D. Structure of Galaxies, the Universe 

E. Movement of Materials (e.g., weathering, plates, tides) 

F. Changes in Materials, Cycles (e.g., weather, rocks) 

G. Evolution 

II. Other (specify area) 

m. Physical Sciences-Chemistry 
A- States of Matter 

B. Models of Atomic, Molecular, Ionic Structures 

C Polarity and Implications for Properties of Molecules 

D. Simple Nuclear Chemistry (e.g„ radioactivity, fission, fusion) 

E. Simple Models for Chemical Bonds (including implications for properties and 
geometries of molecules) 

F. Factors that Govern Chemical Transformations (e.g.. energy and spatial chances ana 
effects, chemical equilibrium, electrolytic and voltaic cells, radioactive decay) 

G. Chemical Reactions 

H. Other (specify area) 

IV. Physical Sciences-Physics 

A- Mechanics (e,g u motion, dynamics, gravity) 

B. Conservation of Mass, Momentum and Energy 

C. Heat 

D. Electricity and Magnetism 

E Wave Motion (e.g., sound, light) 

F. Atomic and Nuclear 

G. Other (specify area) 

V. Safety and Manipulative Skills 

A. Laboratory Safety 

B. Manipulative Laboratory Skills 



395 



ERIC 



B.15 



California New Teadier Project 
Science Laboratory Assessment 



Pre-Observation Conference Note-Taking Form 

(Part II of IV) 



Observer 



Teacher 



Date 



Start time 



End ;ime 





Element 




Question# 


Code(s) 


Responses/Notes 



I 



California New Teacher Project 
Science Laboratory Amoament 

DOCUMENTATION SORTING RECORD 

RMC Research Corporation 
Mountain Vievv, California 



OBSERVER TEACHER DATE 



DOMAIN/ 

ELEMENT EVIDENCE/NOTES/RESPONSES 

A. PEDAGOGY: 
1. Planning 



2. Sequence 



3. Prelab 



4. Directions 



5. Explanation/ 
Presentation 



9 

ERIC 



B.17 3<J? 



OBSERVER 



TEAUlhK 



DOMAIN/ 

ELEMENT EVIDENCE/NOTES/RESPONSES 

6. Monitoring/ 
Adjusting 



7. Feedback 



3. Questioning 



9. Closure 



B. CONTENT: 
1. Accurate 



2. Integrated 



B.18 



ERIC 3US 



DOMAIN/ 

ELEMENT E\TOENCENOTES/RESPONSES 



3. Related to 
Objectives 



C. MATERIALS/EQUIPMENT: 
i. Teacher Use 



2. Safe Setup 



3. Safe Practices 



4, Availability 



b.19 399 



1 o 

EMC 



OBSERVER TEACHER DATE ■ 

DOMAIN/ 



ELEMENT EVIDENCE-NOTES/RESPONSES | 

I 



D. MANAGEMENT: 
1. Grouping 



2. Other 
Personnel 



3. Routines and 
Transitions 



4. Student 
Engagement 



5. Timing 



9 

ERIC 



B,20 

4: (I 



OBSERVER 



TEACHER 



DATE 



DOMAIN/ 

ELEMENT EVIDENCE/NOTES/RESPONSES 



6. Student 
Behavior 



7. Lab Cleanup 



E. KNOWLEDGE OF STUDENTS: 
1. Diversity 



2, Student 
Characteristics 




ERJC 



B.21 401 

* 



OBSERVER 



TEACHER 



UAlE 



DOMAIN/ 

ELEMENT EVIDENCE/NOTES/RESPONSES 

F. CLIMATE: 
I. Interactions 
with Students 



2. Interactions 
amonw Students 



3. Attitudes 



4. Inquiry 



B.22 

ERIC 41 



I OBSERVER TEACHER DATE. 

- DOMAIN/ 

J ELEMENT EVIDENCE'NOTES/RESPONSES 

| G. COMMUNICATION: 
I. Speaking 

I 

I 

■ 2. Writing 

I 



3. Listening 



4. Strength of 
Presence 



Observer: 



(sign; 

Date: 



B.23 40 g 

er|c 



California New Teacher Project ■ 
Science Laboratory Assessment | 

SUMMARY REPORT FORM 

i 

TEACHER DATE OF OBSERVATION 

JUDGMENTS DOMAIN/ELEMENTS REMARKS 

A PEDAGOGY (Planning, Sequence, Prelab, Directions, Explanation/Presentation. 
Monitoring/Adjusting, Feedback, Questioning, Closure) 



B. CONTENT (Accurate, Integrated, Related to Objectives) 




C. MATERIALS/EQUIPMENT (Teacher Use, Safe Setup, Safe Practices, Availability) 




D. MANAGEMENT (Grouping, Other Personnel, Routines & Transitions. Student 
Engagement, Timing, Student Behavior, Lab Cleanup) 




B.24 



ERIC 



404 



E. KNOWLEDGE OF STUDENTS (Diversity, Student Characteristics) 



F. CLIMATE (Interactions with Students. Interactions among Students. Attitudes. 
Inquiry) 



G. COMMUNICATION (Speaking, Writing, Listening, Strength of Presence) 



OVERALL JUDGMENT 
COMMENTS: 



OBSERVER DATE / / 

pruited 

signed 



B.25 



405 



California New Teacher Project 
Science Laboratory Aavauncnt 



Questions for the Pre-Observation Conference 

(Part II of TV) 

RMC Research Corporation 
Mountain View, California 

Below are the questions that will be asked during the Pre-Observation Conference. 

1. I have reviewed your Questionnaire. Is there anything on it you need to change before we 
continue? 

2. Why did you select this particular activity? 

3. Did you design or modify the activity in order to make it appropriate for the students' 
background and interests, or to better enable you to accomplish your objectives? If yes. 
explain how and why. 

4. Explain the scientific concepts and/or skills you are teaching in this lab activity. 

5. What are some of the incorrect preconceptions that students may have that relate to this 
activity? (pause) How do you plan to address these during the lesson? 

6. What prior instruction have you implemented related to the lab activity? (pause) What 
do students already know about this topic? 

7. Have you provided previous instruction to ensure that students have the technical skills 
(e.g., students know how to use a voltmeter) requisite to the successful completion of this 
laboratory activity? If yes, was this provided recently? If not. what techniques have you 
employed to provide you evidence that students are ready to use the required processes and 
technical skills? 

8. What instruction are you planning to do iu the future related to the activity? 

9. What is the relationship or contribution of this laboratory activity to the broad goals for the 
students' learning? (pause) Does it provide linkage from one concept to ihe next, or is 
it part of a continuing direction within one major concept? If yes, please explain. 

10. What advanced thinking skills (e.g., comparing, estimating, inferring) will students be 
encouraged to use or required to apply in order to productively participate in this activity? 

11. What factors did you consider in grouping students for this activity? (pause) Is this a 
departure from your normal grouping for this class? 

11 What safety precautions will you take into consideration during this activity? (pause) What 

would you do if ? (The observer should ask 

about a safety problem that might occur in this area of science; for example, a dangerous 
chemical spill a heat bum, a deep cut on a student's hand.) 



9 

ERIC 



B.26 

4n; 



13. Do you have a sufficient supply of materials and equipment for this activity? (pause) Are 
there any equipment problems or limitations that I should know about? (pause) If yes. 
how do you plan to cope with shortages or problems? 

14. Are there any special procedures that must be followed in cleaning up after the activity? 

15. Is there anything else you would like to tell me about your students and today's laboratory 
activity that we haven't covered in this meeting or that wasn't on the Questionnaire and 
that would help me better understand and assess the activity I observe? 




B.27 



407 



California New Teacher Project 
Sfj fap ft Laboratory Aatcameni 

GUIDED NOTE-TAKING RECORD FOR THE OBSERVATION 

Part III of IV 



OBSERVER TEACHER DATE PAGE. 

' TIME 



DOMAIN/ EVIDENCE/NOTES 
ELEMENTS (be specific) 

A PEDAGOGY 

1. Planning 

2. Sequence 

3. Prelab 

4. Directions 

5. Explanation/Presentation 

6. Monitoring/Adjusting 

7. Feedback 

8. Questioning 

9. Closure 



B. CONTENT 

1. Accurate 

2. Integrated 

3. Related to Objectives 



MATERIALS/ 
EQUIPMENT 
1. Teacher Use 
Z Safe Setup 

3. Safe Practices 

4. Availability 



D. MANAGEMENT 
1. Grouping 
Z Other Personnel 

3. Routines & Transitions 

4. Student Engagement 

5. Timing 

6. Student Behavior 

7. Lab Cleanup 



E KNOWLEDGE OF 
STUDENTS 
I. Diversity 

Z Student Characteristics 



F. CLIMATE 

1. Interactions with Students 
Z Interactions among Students 

3. Attitudes 

4. Inquiry 



G. COMMUNICATION 
1. Speaking 
1 Writing 

3. Listening 

4. Strength of Presence 



9 

ERIC 



B.28 



4GS 



California New Teacher Project 
Science Litaontocy A a aca am ent 

Questions for the Post-Observation Conference 

(Part IV of IV) 

RMC Research Corporation 
Mountain View, California 



Below are the questions to be asked during the Post-Observation Conference. The observer should 
add questions that will improve the understanding of what was observed and the assessment made for 
each Domain. 

1. Did the lab activity go as you expected? (pause) If no, describe what happened that was 
unexpected, (pause) How are you going to deal with this problem? 

2. Were the objectives attained by your students? (pause) What type of feedback will you 
provide to students now that the lab period is over? 

3. How do you plan to assess the retention of these objectives? 

4. Based on how your students did today, do you feel you need to do additional follow-up 
instruction related to this activity? 

5. If you were to redo this activity, what changes would be desirable? (Possible prompts: Any 
in the content? Any procedural changes? Any questions you might have asked students in 
order to redirect them? Any other changes?) 

6. (If necessary, create your own question(s) to fill in missing information on the Domains and 
Elements. List your questions(s) on ttie Note-Taking Form.) 

1. (If there are questions you have about areas not cov red by the Domains and Elements, but 
which are related to the assessment process, ask these, too, ana record the questions on the 
Note-Taking Form.) 

8. Is there anything else you would like to tell me about today's laboratory activity? 



9 

ERJC 



B - 2 *09 



APPENDIX Ci 



STATISTICAL COMPARISON OF GROUP PERFORMANCE ON THE ASSESSMENT 
OF COMPETENCE IN MONITORING STUDENT ACHIEVEMENT 



410 



TABLE 



STATISTICAL COMPARISON OF GROUP PERFORMANCE ON THE ASSESSMENT 
OF COMPETENCE IN MONITORING STUDENT ACHIEVEMENT 



Groupof Teachers 



Mean 



Form A 

llSDIi 



Pretest; 



m 



Mean 



N 



Males 
Females 



39.8 
41.4 



12.7 
14.5 



5 
21 



42.6 
45.7 



11.7 
10.5 



9 

15 



K-3 Teachers 
4-6 Teachers 



46.1 
35.2 



14.4 
8.4 



12 
5 



46.3 
42.9 



8.8 
14.7 



12 
9 



Inner-City Teachers 37.4 15.6 
Non Inner-City 43.5 15.3 

Teachers 



5 

15 



38.0 
46.5 



13.9 
10.7 



17 



White Teachers 43.6 13.8 

Minority Teachers 23.0 1 1 .3 



16 
2 



46.9 
36.3 



10.8 
11.7 



17 
4 







;.;.;.;<. \v.v.v w.y.y.vlyv.y 


Post 


test--: 


y.y.;.y.y.y.y.y.vXy.;.yXvX y.y 


v.vv.v.v.y.V. v. 

.•x-.-.v.v.v.vx- 


wmmmmmmMi 














iliiBiiiiifiiill 


mmm so i will 




wmmm 


wmmm 


Males 


44.1 


19.0 


8 


48.6 


3.8 


5 


Females 


43.9 


13.8 


14 


47.7 


12.7 


19 


K-3 Teachers 


43.9 


11.8 


12 


50.9 


13.4 


12 


4-6 Teachers 


45.1 


20.5 


9 


44.6 


7.0 


5 


Inner-City Teachers 


37.8 


20.3 


4 


43.4 


12.8 


5 


Non Inner-City 


46.0 


14.7 


17 


50.6 


11.5 


14 


Teachers 














White Teachers 


45.2 


16.7 


17 


48.3 


11.9 


16 


Minority Teachers 


41.0 


11.0 


4 


48.5 


19.1 


2 



C.l 



ERIC 



411 



APPENDIX Di 

AN EXAMPLE OF A SCORING SHEET FOR THE LANGUAGE 
ARTS PEDAGOGICAL KNOWLEDGE ASSESSMENT 



412 



Question 2. Goals 



What is the teacher's main goal for the students? Provide a rationale for why this 
is an important goal for a language arts activity. 

(One point for the goal; one point for each supportive statement in the rationale; two 
points possible in the rationale section-for weaving two of the supportive statements into 
an answer; three points maximum) 

A. Main goal: 

_____ The primary goal of the teacher is to encourage her students to 

grapple with the challenge of making a difficult decision (decision 
making). The decision in this case (in the text and in the class 
activities) is one without a clear or single answer. 

B. Rationale: 

Note: Any item listed below could be used as a rationale for the goal. 

The rationale for this goal as a language arts activity could Include any or several of the 
following items: 

Goal is connected to the text (central character faces a similar 

decision) 

Students can benefit from examining/discussing the steps 

involved in decision making 

Students recognize there may be no one right or wrong choice, 

Instead each alternative has good points and bad points 

Provides opportunity for meaningful, Involved discussion of book 

Provides opportunity for discussion of ethical or cultural issues 

Discussion or class activities using this goal requires higher order 

thinking sklls 

Class activities or discussions wil incorporate listening, speaking, 

reading and writing (integrated approach) 

Total for Question 2 



Scoring Key D# l Scenario 3 



er|c 4 1 3 



APPENDIX Ei 



STATISTICAL COMPARISON OF TEACHER PERFORMANCE ON THE LANGUAGE 
ARTS PEDAGOGICAL KNOWLEDGE ASSESSMENT 



414 



11:27 Monday, October 22, 1990 



CTC PILOT TEST ANALYSES: LAPKA DATA 
Program: F:\DATA\SAS\PROGRAMS\LAPKA_A.SAS 
Input Data: F:\DATA\SAS\DATASETS\LAPKADAT.SSD, LAPKADEM.SSD 
Output File: F:\DATA\SAS\OUTPUT\LAPKA_A.OUT 

Analysis A: Descriptive Statistics 
Part 4: Scenario and LAPKA Total Scores summed across raters (except for Scenario 3) 



candidate's 
gender 

J Fema- 
Nale I le 



Candidate's 
race 



Non- 
mi no- 



Mi no- 
rity 



Candidate's 
education 



I Candidate' si candidate's 
preparation | grade 



Candidate's 
location 



M-Da- 
ta 

Miss- 
ing 



Non- 
Cali- 
forn- 

ia 



Cali- 
forn- 
ia 



<= 1 
Cour- 
se 



>= 2 
Cour- 
ses 



Grade 
3 or 
less 



Grade 
4 or 
more 



Subu- 
rban - 
/Ruv - 
al 



Urban 



Inner 
City 



ALL 



totsia_s: Total 


MEAN 


1 


.|17 


62 | 17. 


41 


18 


50 | 




17 


.50 


17 . 65| 17 . 33 | 17 .83 


17 


.82 


| 17 . 40 | 18 


. 4 3 | 16. 67 | 18 . 20 | 17 


.62 


for Scenario 1A 


STD 


I 


•1 3 


54 | 3. 


79 


2. 


38 | 




1 


.29| 3.92| 4.97| 2.17 


« 


.73 


| 1 . 7 1 1 2 


.70| 4, 


72 | 1 . 92 | 3 


.54 




N 


I 




21| 


17 




4 1 






17| 9| 12 | 


11 


1 io| 


n 


9| 5| 


21 




MIN 


I 


•1 5 


00 1 5. 


00 | 16 


00 | 




16 


. 00 


5.00| 5.00|14.00| 5 


.00 


| 14 . 00 | 14 


. 00 | 5. 


00|15.00| 5 


. 00 




MAX 


I 


. | 22 


00| 22. 


00 


21 


00 | 




19 


.00 


22.00|22.00|21.00|22 


.00|20.00|22 


.00(2. 


00|20.00|22 


.00 


PCTS1A S: Pet. 


MEAN 


I 


.|67 


77 | 66. 


97 | 71 


15 | 


.|67 


.3l|67.87|66.67|68.59|68 


.53 


| 66.92)70.88(64. 


1 0 | 70.00J 67 


.7 / 


lor Scenario 1A 


STD 


l 


• | 13 


62 | 14. 


59 




16 | 


• 


4 


. 97 | 15. 08 | 19 . 13 | 8 . 34 | 18 


• 19 | 6.59|10 


. 38| 18. 


14| 7 . 40 | 13 


.62 




N 


i 


o| 


21 1 


17 | 


* ! 






17| 9| 12 | 




10| 


n 


9 




21 




MIN 


I 


.|19 


.23 | 19. 


23 | 61 


54 1 


.|61 


.54|19.23|19.23|53.85|19 


. 23 | 53 . 85 | 53 


.85 | 19 


; l| 57.69) 19 


.23 




MAX 


I 


. | 84 


,62 | 84 . 


62 


80 


77 1 


.|73 


.08|84.62|84.62|80.77|84 


. 62 | 76. 92 | 84 


. U2 | 00. 


V / | 76. 92 | 84 


.62 


TO*l:;iH S: Total 
for Kronari" 1H 


MEAN 


I 7 


.25) r. 


. 85 1 r>. 


9? 


7 


00 1 7 


00 


6 


. 50 


7 . 08 | 6 . 89 | 7 . 00 


6 


. 89 


7.0()j 7 


. 00 | 6. 


8 1 




fl 


.94 



(i mil Mini li) 



ERIC 



415 



416 



11:27 Monday, October 22, 1990 22 



CTC PILOT TEST ANALYSES: LAPKA DATA 
Program: F:\DATA\SAS\PROGRAHS\LAPKA_A.SAS 
Input Data: F:\DATA\SAS\DATASETS\LAPKADAT.SSD, LAPKADEM. SSD 
Output File: F:\DATA\SAS\OUTPUT\LAPKA_A.OUT 

Analysis A: Descriptive Statistics 
Part 4: Scenario and LAPKA T-tal Scores summed across raters (except for Scenario J) 



m 
» 

IS) 



417 



Candidate's 
gender 



candidate's 
race 



Non- 
mino- 
rity 



TOTS1BS: Total 
for Scenario IB 



STD 
N 

MI N 



PCTS1B S: Pet. 
for Scenario IB 



TOTS 2 S : Total 
for Scenario i. 



MAX 
MEAN 

STD 
N 

Ml N 



MAX 
MEAN 

ST!) 



j Fema- 
Male I le 

0.96| 0.69| 0.79 

*| 13 | 12 



+ + 

6.00| 6.00| 6.00 

8.00| 8.00| 8.00 
10.62|85.58|86.46 
1 .97) B.6l| 9.91 



+ + 

4| 13 | 12 

75.01|75.00|75.00 

100. Oj 100. 0| 100.0 

53 . 25 | 4 v> . 03 | 47 . 70 

7 . 72 | 9 . 1 3 | 9.68 



Mino- 
rity 

0.71 

5 



Candidate's 
education 



Cand idate' s Cand i date ' s 
preparation | grade | 



candidate' r» 
location 



M^Dt-- 
ta 

Miss- 
ing 



Non- 
Cali- 
f orn- 

ia 



Cali- 
forn- 
ia 



6.00 
8.00 
87.50 
8.84 



5 

75.00 
100.0 
43.90 
7.11 



.| 0V5fi| 0.79 



i 



<« 1 
Cour- 
se 



2 

Cour- 
ses 



Grade 
3 or 
less 



0.60^,931 0.78 



Grade 
4 or 
more 

0.76 



1| 4| 



12 



9| 



P 



7.00| 6.00| 6.00| 6.00| 6.00 
7.00| 7.00| 8.00| 8.00| 8.00 
87. 50 | 8 1.25 | 88. 54 | 86. 11 | 87.50 
.| 7 . 22 | 9.91 j 7.51|11.57 



1| 4| 12 | 9| 8 
87. 50 | 75. 00 j 75. 00 | 75. 00 | 75.00 
87. 50 | 87. 50 | 100.0 | 100.0 | 100.0 
64. 00 | 4 7. 13 | 46. 10 | 49. 06 | 4 5. 04 

. | 7 . 32 | 9 . 30 | 9 . 89 | 8.42 



6.00 
8.00 
86. 11 
9.77 



9 

75.00 
100.0 
44.32 
8. 84 



6.00 
8.00 
87.50 
9.45 



8 

75.00 
100.0 
48.95 
9.11 



Subu- 
rban- 
/Rur- 
al 



Inner 
City 



Urban 

0.89 



1 1 



6.00 
8.00 
87. 50 
11.18 



1 1 

75.00 
100.0 

47.55 
9.r,7 



0.41 
6 



6.00 
7.00 
85.42 
5.10 



6 

75.00 
87.50 
44 .67 

9.52 



0 



49 . GO 

5. 26 



ALIi 
0-75 
17 



6.00 
8.00 
86.76 
9.34 



17 

7 5.00 
100.0 
46.75 
9.17 



(( oni irnn:i>) 



4 S 



ERJ.C 



11:27 Monday, October 22, 1990 



CTC PILOT TEST ANALYSES: LAPKA DATA 
Program: F: \ DATA \SAS\ PROGRAMS \LAPKA_A . SAS 
input Data: F:\DATA\SAS\DATASETS\LAPKADAT.SSD, LAPKADEN.SSD 
Output File: F: \DATA\SAS\OUTPUT\LAPKA_A # OUT 

Analysis A: Descriptive Statistics 
Part 4: Scenario and LAPKA Total Scores summed across raters (except for Scenario 3) 













Candidate's 


Candidate's 
education 


Candidate' s 
preparation 


Candidate' s 
grade 


Candidate's 
location 








gender 


's 


Non- 
mino- 
rity 






M=Da- 
ta 

Miss- 
ing 


Non- 
Cali- 
f orn- 

ia 


Cali- 
forn- 
ia 




i 






Grade 
3 or 
less 


Grade 
4 or 
more 


Subu- 
rban- 
/Rur- 
al 














Male 


Fema- 
le 


Mino- 
rity 


Cour- 
se 


Cour- 
ses 


Urban 


Inner 
City 


ALL 


N 




4 




36 




30 




10 




1| 


8 




31 




17 




23 




19 




21 | 


20 




15 


5 




40 


MIN 


46 


. 00 | 28 . 


00 


28 


. 00 


34. 


00 


64 


. 00 | 34 . 


00 


28. 


00 


28 


. 00 


28 


. 00 | 28 


. 00 


28 


.00 


28 


.00 


28 


.00 


41 . 00 


28 


. 00 


MAX 


64 


. 00 


65. 


00 


65 


,00 


60. 00 


64 


.or. 


57. 


00 


65. 


00 


65 


.00 


63 


.00 


65 


. 00 


64 


. 00 


65 


. 00 


63 


.00|55.00 


65 


,00 


MEAN 


64 


.94 


56. 


13 


58 


. 17 


53. 


54 


78 


• 05 | 57. 


47 


56. 


22 


59 


.83 


54 


.93 


54 


.04 


59 


.70 


57 


. 99 


54 


.47 


60.71 


57 


.01 


STI) 


9 


.41 


11 . 


13 


11 


.81 


8. 


67 | 




• 


8. 


92 


11. 


34 


12 


06 | 


10 


. 2 7 | 


10 


.79| 


11 


.11 


11 


. 79 | 


11 


61 | 


6.42 


1 1 


. 19 


N 




4 




36 




30 




10 | 




M 




8 




31 | 




17| 




23 | 




19| 




21 




20 | 




15| 


5| 




40 


MIN 


56 


. 10 


34 . 


15| 


34 


. 15| 


41 . 


46 | 


78 


. 05 | 41. 


46| 


34. 


15| 


34 


15 | 


34 


. 1 5 | 


34 


. 15 | 


34 


. 15| 


34 


. 15 | 


34 


15| 


50. 00 


34 


15 


MAX 


78 


. 05 


79. 


27 


79 


.27 | 


73. 


17 | 


78 


• 05 | 69. 


51 | 


79. 


27| 


79. 


27 | 


76 


.83 | 


79 


. 27 | 


78 


. 05 | 


79 


. 27 | 


76.83 | 


67.07 | 


79, 


27 


MEAN 


26.25 


23.92j 


24 


.06 | 


24. 


40 | 


30 


. 00 | 24 . 


50 | 


23. 


87 | 


25. 


33 | 


23 


. 22 | 


23 


.52 | 


24 


. 80 | 


24 


. 80 | 


23, 


12 | 


24 . 80 | 


24 . 


15 


STI) | 


7 


. 41 | 


4 .80| 


4 


. 59 | 


6. 


52 | 




•1 


3 . 


66 | 


5. 


34 | 


4 . 


73 | 


5 


. 18 | 


5 


.88 | 


4 


. 0 2 | 


5 


.61 | 


4 . 


79 | 


3 . 42 | 


5 . 


04 


N | 




4 | 




17 | 




31 | 




1 0 | 




11 




8 | 




32 | 




18 | 




23 | 




21 | 




20 | 




20| 




16 | 


5 




41 



TOTS2 S: Total 
for Scenario 2 



PCTS 2 S: Pet. 
for Scenario 2 



TOTALS 3 : Total 
for Scenario 3, 
1 H.itor 



I Mill I ! IN U l>) 



ERLC 



419 



420 



11:27 Monday, October 22, 1990 24 



CTC PILOT TEST ANALYSES: LAPKA DATA 
Program: F:\DATA\SAS\PKOGRAMS\LAPKA_A.SAS 
Input Data: F:\DATA\SAS\DATASETS\LAPKADAT.SSD, LAPKADEM. SSD 
Output File: F: \DATA\SAS\CUTPUT\LAPKA A. OUT 

Analysis A: Descriptive Statistics 
Part 4: Scenario and LAPKA Total Scores summed across raters (except for Scenario 3) 



Candidate's 
gender 



Hale 



Fema- 
le 



Candidate's 
race 



Candidate's 
education 



(candidate's! Candidate's! 
j preparation! gr<ide | 



Candidate' s 
location 









Non- 












Subu - 


Non- 




ta 


Cali- 


Cali- 


<= 1 


>= 2 


Grade 


Grade 


rban- 


mino- 


Mino- 


Miss- 


forn- 


f orn- 


Cour- 


Cour- 


3 or 


4 or 


/Rur- 


rity 


rity 


ing 


ia 


ia 


se 


ses 


less 


more 


al 



Urban 



Inner 
City 



ALL 



42» 



TOTALS 3 : Total 
for Scenario 3 f 
1 Rater 


MIN 


| 16. 00 | 


12 


00 | 16. 


00 


12 


.00 


30. 00 j 20. 00 | 12.00 j 16.00 | 12 


. 00| 


12.00 


16 


.00| 16 


00 


12 


. 00 


| 1<K00| 12 


00 


MAX 


| 3 3 . 00 | 


32 


00 | 33 . 


00 


32 


.00 


3 0. 00 | 3 0.00 | 33.00) 3 2.00 | 31 


.00| 


33.00 


31 


.00j 33 


00 | 


30 


.00 


| 28. 00J 33 


00 


PCTS3; Pet. for 


MEAN 


| 75 . 00 | 


68 


34 | 68 . 


76 


69.71 


85.71 | 70.00 | 68. 21)72.38 |66 


.34| 


67.21 


70.86|70.86| 


66 


.07 


| 70.86j68 


99 


Scenario 3, 1 
Rater 


STD 


|21.17| 


13 


73 | 1 3 - 


11 


18 


62 


. | 10 . 47 | 15. 25| 13 . 51 | 14 


.79) 


16.80 


11 


.49 | 16, 


02 | 


13 


.68 


9.77( 14 


39 




N 


1 4| 




37| 


31 




10 


1| 8| 32| 18 | 


23 | 


21 




20 | 


20 | 




16 


5| 


41 




MIN 


| 4 5 . 7 1 1 


34 


29(45. 


71 


34 


29 


85. 71(57. 14 | 34 . 29 | 45.71 | 34 


. 29 | 


34.29 


45 


.71 ( 45. 


71 | 


34 


. 29 


54 . 29 | 34. 


29 




MAX 


| 94 . 29 | 


91 


43 | 94 . 


29 


91 


43 


85. 71 1 85* 71 1 94 . 29 | 91 . 43 | 94 


. 29 1 


94.29 


88 


. 57 | 94 . 


29 | 


85 


.71 


80.00|94. 


29 


TOTS S: Total 


MEAN 


| 86 . 75 | 84 


47 | 85 . 


81 


81 


4 4 | 


10 1. 0 | 83. 62 | 84. 44 | 86.94(82 


.74) 


81.06 


88 


.39(84 . 


50 | 


82 


.29 


94.2$|84. 


72 


LAPKA Score 


STI) 


| 1 1 . 09 | 


11 


85 | 11. 


86 


10 


92 


. | 13 . 38 | 11 li | 12. 30 | 10 


. 96 | 10.90 


1 1 


.47 j 11. 


5H| 


12 


.21 


4 . 99 | 11. 


64 




N 


1 «l 




12 | 


27 






1( 8| 27) 17 | 


19| 


18 | 




1B| 


lfl | 




14 


"1 


36 




MIN 


|74.00|63. 


00 | 61 . 


00 


6 6.00| !01.0|63.00|66.00|66.00|63 


. 00 | 6H.0O| 


6 3 


00 | 63. 


00 | 


66 


. 00 


90. 00 | 63 . 


00 



(i mil nun 1 1 ) 




11:27 Monday, October 22, 1990 25 



CTC PILOT TEST ANALYSES: LAPKA DATA 
Program: F: \DATA\SAS\PROGRAMS\LAPKA_A . SAS 
Input Data: F:\DATA\SAS\DATASETS\LAPKADAT.SSD, LAPKADEM. SSD 
Output File; F:\DATA\SAS\OUTPUT\LAPKA_A.OUT 

Analysis A: Descriptive Statistics 
Part 4: Scenario and LAPKA Total Scores summed across raters (except for Scenario 3) 















Candidate 


i'S 


Candidate's 
education 


Candidate's 
preparation 


Candidate's 
grade 


Candidate ' s 
location 










canaiaace's 
gender 

1 Fema- 
Male | le 


rac 

Non- 
mino- 
rity 


:e 

Mino- 
rity 


ta 
Miss- 
ing 


Non- 
Cali- 
f orn- 

ia 


Cali- 
forn- 
ia 


<= 1 
Cour- 
se 


>= 2 
Cour- 
ses 


Grade 
3 or 
less 


Grade 
4 or 
more 


Subu- 
rban- 
/Rur- 
al 


Urban 


Inner 
City 


ALL 


TOTS S: Total 
LAPKA Score 


MAX 


101 


. 0 


105.0 


105 


. 0 


97 . 


00 


101 


. 0 


101.0 


105.0 


105 


.0 


105.0 


105.0 


105.0 


105.0 


105.0 


101.0 


105.0 


PCTS S: Pet. 


MEAN 


69. 


40 


62. 


15 


63. 


45 


61. 


46 


80. 80 


62.15 


62.53 


64.97 


61. 15 


60.46 


65. 


4 5 | 64 . 30 


60. 38 


65.91 


62 . 


96 


LAPKA Score 


STD 


8 . 


87 


8 . 


99 


9. 


32 


8. 


95 






6.80 


9.35 


10. 


67 


7 . 36 


10.24 


7. 


35 


9.95 


8 .97 


3 .49 


9. 


14 




N 




4 




32 




27 




9 




1 


8 


27 




17 


19 


18 




18 


IB 


14 


4 




36 




MIN 


59. 


20 


47. 


55 


47. 


55 


49. 


65 


80. 80 


50.40 


47.55 


47. 


55 


47. 55 


47.55 


50. 


40|47.55 


47.55 


62.94 


47. 


55 




MAX 


80. 


80 


84. 


00 


84 . 


00 


77. 


60 


80. 


80 


70.63 


84.00 


84 . 


00 


73.43 


84 . 00 


80. 


80 | 84 . 00 


77.60 


70.63 


84. 


00 


PCTAVS: Ave. 


MFAN 


76. 


85 


66. 


68 


67. 


52 


68. 


70 


83. 


75 


67.25 


67. 39 


69. 


45 


66. 35 


65.98 | 


69. 


65| 


70. 39 | 


64 .f>7| 


67.20! 


67 . 


81 


Pet. across 
Scenarios 


STD 


6. 


37 


9. 


51 


9. 


72 


10. 


18 






4 . 54 


10.54 


11. 


49 


7.81 


11.94 | 


6. 


64 | 9.96| 


9 . 99 | 


4 . 24 ! 


9 . 


70 




N 




4 




32 




27 




9 




1 


8 


27 




17 


19 


i e | 




1 8 | 18 


14 | 


4 1 




36 




MIN 


68. 


90 


43. 


48 


4 i . 


4B 


53. 


80 


83. 


75 


61.53 


43.48 


43. 


48 


53 . 80 


43.48 


59. 


76|55.2(l| 


43.48 


6 3.59 | 


4 3 . 


48 




MAX 


8 1 . 


75 


90. 


23|90.23|B2. 


i 3 | H 3 . 


7 5 | 7 3 . 4 9 


90,23 


90.23|81.29 


90.23 


83 . 


75 | 90.2 1 | 


82. 1 3 | 


7 2.30 | 


90. 


23 



m 

en 



423 



o 

ERIC 



424 



BEST COPY AVAILABLE 



APPENDIX Fi 



STATISTICAL COMPARISON OF TEACHER PERFORMANCE ON TEE 
SECONDARY ENGLISH ASSESSMENT 




13:00 Sunday, October 14, 1990 1 



CTC PILOT TEST ANALYSES: SFSU DATA 
Program: F:\DATA\SAS\PROGRAMS\SFSU_F.SAS 
Input Data: F:\DATA\SAS\DATASETS\SFSU DAT.SSD 
Output File: F:\DATA\SAS\OUTPUT\SFSU_F.OUT 

Analysis F: Descriptive Statistics on Subtest and Major Test Ratings Summed Across Raters 



Candidate's 
gender 



Male 



Fema- 
le 



Candidate's 
race 



Non- 
Mino- 



Mino- 
rity 



Candidate's preparation 



Candidate's 
grade 



Candidate' s location 



Miss- 
ing 



o or 
1 

Cour- 
ses 



2 

Cour- 
ses 



3 

Cour- 
ses 



High 
Scho- 
ol 



Midd- 
le/J- 
r . 
High 



Miss- 
ing 



Subu- 
rban 



Urban 



Inner 
City 



ALL 



SA1_RTS: 


MEAN 


1 


5, 


33 | 


6. 


08 | 


5. 


94 




5, 


33 | 


5.00| 


5 


29 | 


6. 


14 | 


6. 


50 | 


5.92 | 


5 


71| 


6 


00 | 


6.50| 


b 


4 0 | 


b, 








Overall 
response 


STD 


1 


0. 


82 | 


1 . 


12| 


1 . 


12 




0 


58 | 


• 1 


1 


n| 


0. 


69 | 


1 . 


29 | 


0.79| 


1 


50 | 




•1 


1 . 05 | 


0 


89 | 


1 . 


13 | 


i. 


0/ 


strategies 


N 


1 




6 I 




»| 




16| 




3 | 


i| 




?| 








<l 


12| 




?| 




i| 


*l 




5| 




7 | 




19 




MIN 


1 




<l 




<l 




<l 




5| 


5| 




<l 










5| 




<l 










«l 




4 | 




4 




MAX 


1 




•I 




8| 




8| 




6 | 


5| 




?| 








8| 


?l 




8| 




*l 


8| 




6| 




7 | 




8 


SA2RTS: 


MEAN 


1 


4 . 




6. 


50 | 


6. 


10| 


5 


.00 | 


•1 


5 


. 4 0 | 


6. 


00 | 


6. 


67| 


6 . 12 | 


5 


bO | 


6 


00 | 


6 . 80 | 


b 


bO | 










Overal 1 

analysis of wrt 


STD 


1 


0. 


96| 


0. 


76| 


1 . 


10| 


1 


41 | 


•1 


1 


34 | 


0. 


82 | 


1. 




1 . 25 | 


1 


00 | 




• 1 


0.84 | 


0 




1 . 


lb| 


l . 


16 


and txt 


N 


1 




<l 




8| 




10| 




2| 


o| 




5| 




<l 




3| 


8| 




M 




ll 


5| 




2| 




4 | 




\2 




MIN 


1 




<l 




6 I 




<l 




4 | 


•1 




<l 








6| 


<l 




<l 




<H 






5| 




4 | 




4 




MAX 


1 




•I 




8| 




8| 




6 | 


•1 












8| 


8| 




6 I 




61 


8| 




6| 




6| 




8 


MA RTS: Jveral) 


MEAN 


1 


5. 


17| 


5. 




5, 


69 | 


5 


00 | 


5 00 | 


5 


.14| 


5. 


86 | 


6. 


oo| 


5 , 83 | 


5 


.14| 


6 


00 | 


6. 00 | 


b 




\i « 


4 3 | 






rat ing, Form A 


STD 


1 


0. 


751 


1, 


oil 


0. 


95 | 


1 


00 | 


•1 


0 


90 | 


0 ♦ 


38| 


1 . 


63| 


0 . 8 J | 


1 


07 | 




• 1 


l . 7 r, | 


0 


84 | 


0. 


;y | 


0. 


96 



(roiiTimif;o) 



4?7 



ERIC 



4;i6 



13:00 Sunday, October 14, 1990 2 



CTC PILOT TEST ANALYSES: SFSU DATA 
Program: F:\DATA\SAS\PROGRAMS\SFSU F.SAS 
Input Data: F: \DATA\SAS\DATASETS\SFSU_DAT. SSD 
Output File: F:\DATA\SAS\OUTPUT\SFSUJ\OUT 



Analysis F: Descriptive Statistics on Subtest and Major Test Ratings Summed Across Raters 

Candidate's location 



Candidate's 
gender 



Fema- 
Male I le 



Candidate's 
race 



Non- 
Mino- 



Mino- 
rity 



I Candidate's 
Candidate's preparation! grade 



Miss- 
ing 



0 or 








Midd- 










1 


2 


3 


High 


le/J- 










Cour- 


Cour- 


Cour- 


Scho- 


r . 


Miss- 


Subu- 




Inner 


ses 


ses 


ses 


ol 


High 


ing 


rban 


Urban 


city 



ALL 



• 

ro 



MA RTS: Overall 


N 


l 




6| 


»l 




16| 




3| 


1| 






7| 




4| 




12\ 


7 | 


i| 


















rating, Form A 


MIN 


1 




4| 


4| 




«l 




4| 


5| 


4| 




M 




4| 






4 | 






4| 




4| 




4| 








MAX 


I 




•I 


■1 




■1 




6| 


5| 


•I 




6| 




8| 




■1 


6 | 


•I 




■1 




•I 








8 


SB 1 RTS: 


MEAN 


l 


6 


.67| 


6.92 | 


6. 


69 | 


7 


. 67 | 6 


00 | 


6.57 | 


7 


.14| 


7 


00 | 


/. 


33 | 


6.00| 8 


00 | 


















Overall 
interpretive 






1 


• 03 | 


1. 38 j 


1 . 


30| 


0 


.58| 


.1 


1.51 | 


1 


.07| 


1 


41| 


0. 


98) 


1. 29 | 


• 1 


1 


.21 | 


1 


67 | 


u 


.98 | 


1 


.26 


process 


N 


I 




•I 


13 | 




16| 




3| 


M 


7| 




7| 




4| 




12| 


7 | 


11 




6 I 




5| 




7| 




19 




MIN 


1 




•I 


4| 




4| 




7| 


«l 


4| 




•I 




5| 




•I 


4 | 


B| 




5| 




4| 












MAX 


l 




■1 


■1 




»l 




8| 


• I 


■1 




«l 




■1 




■1 


8 | 


8, 




■1 




■1 








8 


SB2 RTS : 


MEAN 


l 


6 


- 50 | 


6.85 | 


6 


69 | 


7 


. 00 | 6 


, 00 | 


6.86 | 


6 


.71| 


6 




6. 


92 | 


6.43| 7 


00 | 


















Overall group 
process 


STD 


l 


0 


. 84 | 


0.99| 


0 


95 | 


1 


- 00 | 


• 1 


1 .07 | 


1 


.U| 


0 


50 | 


t. 


00 | 


0.79| 


• 1 


0 


.I2| 


0 


89 | 


0 


9b | 


0 


9 J 




N 


l 




•I 


"I 




16| 




3| 


ll 


7| 




7| 




4! 




12 1 


7 | 


ll 




f 'l 




5| 




7| 




19 




MIN 


1 




• I 


5| 




5| 




6| 


6| 


6| 




5| 




«l 




5| 


6| 


7| 
















b 



(CONTINUE)) 

4 S 



9 

ERLC 



13:00 Sunday, October 14, 1990 



CTC PILOT TEST ANALYSES: SFSU DATA 
Program: F: \ DATA \SAS\ PROGRAMS \ S FSUF . SAS 
Input Data: F:\DATA\SAS\DATASETS\SFSU_DAT.SSD 
Output File: F:\DATA\SAS\OUTPUT\SFSU_F.OUT 

Analysis P: Descriptive Statistics on Subtest and Major Test Ratings Summed Across Raters 





Candidate's 
gender 


Candidate's 
race 


Candidate's preparation 


Candidate's! 
grade 1 


Candidate's location 


ALL 


Miss- 
ing 


0 or 
1 

Cour- 
ses 


2 

Cour- 
ses 


3 

Cour- 
ses 


High 
Scho- 
ol 


Midd- 
le/J- 
r . 
High 


Miss- 
ing 


Subu- 
rban 


Urban 


Inner 
City 


Non- 
Mino- 
rity 


Mino- 
rity 


Male 


Fema- 
le 


SB2 RTS: 
overall group 


MAX 


8 


8 


8 


8 


6 


8 


8 


7 


8 


8 


7 


8 


8 


8 


8 


MB rtTS: overall 
rating, Form B 


^MEAN 1 ti.6M 7.08 1 6.80| 7.67 | 7.00| 6.86 j 6.83 j 7.25 


7 . 27 | 6 . 4 3 | 8,00| 7 . 40 | 6 . 80 | 6 . 57 | 6.94 


STD | i.03| 1.08) 1.08 | 0 . 58 | .| 1 . 07 | 1.33| 0.96 


| 1 - 10 | 0.79| . | 0.89| 1 . 1 0 1 1 . 13 | 1.06 


M 1 6| 12| 15| 3| 1| 7| 6| 4| U| 7| l| 5| 


MIN | 6| 5| 5| 7| 7| 6| 5| 6| 5| 6| B|_ 6j 


MAX | B| B| B| B| 7| B| B| B| B| B|_ -| -1 „„!|„„? 



SC1RTS: 
Overall content 



MEAN 

STD 

N 

MIN 
MAX 



■ + + + + 

I 1.471 1 . 56 | 1 . 61 1 0. 58 | 

- + + + + ♦■ 



I 



*\ 



■I 



13 I "I 
--- + + - 

4| 4| 

•-- + 

8| B| 



3| 



1| 7| 



6| 



4| 4| 



7| 



8| 



50 1 0 


96 1 1 


30 1 1 


70| 


•1 1 


33 | 


1 . 1 0 | 1 


86 1 1 


49 


f\ 


4| 


12| 


7| 


1| 


6 | 


5| 




19 


4| 


6| 


*\ 


*\ 


7| 


5 | 


5| 


•\ 


4 


B| 


B| 


B| 


■1 


7| 


8 | 


8| 




8 



(CONTINUED) 



431 



4 .Hi 



13:00 Sunday, October 14, 1990 4 



CTC PILOT TEST ANALYSES: SFSU DATA 
Program : F : \DATA\SAS \ PROGRAMS \ SFSU F . SAS 
Input Data: F:\DATA\SAS\DATASETS\SFSU_DAT.SSD 
Output File: F:\DATA\SAS\OUTPUT\SFSU_F.OUT 

Analysis F: Descriptive Statistics on Subtest and Major Test Ratings Summed Across Raters 



Candidate's 
gender 


Candidate's 
race 


Non- 
Mino- 
rity 


Mino- 
rity 


Male 


Fema- 
le 



[Candidate's 
Candidate's preparation! grade 



Candidate ' s location 



Miss- 
ing 



0 or 
1 

Cour- 
ses 



2 

Cour- 
ses 



3 

Cour- 
ses 



High 
Scho- 
ol 



Midd- 
le/J- 
r. 
High 



Miss- 
ing 



Subu- 
rban 



Urban 



Inner 
City 



ALL 



SC2RTS: 


MEAN 


1 


6. 


00 | 


6. 


08 | 


6. 


13| 


5. 


67| 


4 


00 | 


5, 


86 | 


6. 


17| 


6 


75| 


6. 


27| 


5. 


71 1 


6 


. 00 | 


6. 


83| 












Overall plan 




1 


0. 


71| 


1. 


50 | 


1. 


<1| 


0. 


58 | 




• 1 


1, 


21| 


1. 


17| 


1, 


50 | 


1. 


19 | 


1 . 


50 | 




• 1 


1. 


33 | 


l 


29 | 


1 


.25| 


1.30 




N 


1 




5| 




13 1 




15| 




3| 




i| 




■>\ 




«l 




4| 




Hi 




'1 




i| 




6| 




4| 




7| 


18 




MIN 






5| 




4| 




<l 




5| 




<l 




4| 




4| 




5| 




<l 




4| 








5| 




4| 




4| 








1 




''I 




8| 




8| 




«l 




4| 




8| 




?l 




8| 




8| 




8| 




«l 




8| 




7| 




7| 


8 


SC3RTS: 


MEAN 


1 


6. 


50 | 


6. 




6. 


75| 


6. 


33 | 


6 


. 00 | 


6 


86 | 


6. 


29 | 


7 




6. 


92 | 


6. 


29 | 


8 


. 00 | 


/. 


"1 


6 


80 | 








Overall 
delivery 


STD 


1 


0. 


84 | 


1. 


09 | 


0. 


93| 


1. 


53| 




•! 


0 


90 | 


0. 


76 | 


1 


50 | 


0 . 


90 | 


1. 


Hi 




• 1 


1. 


17| 


0 


84| 


0 


.58| 


1.00 




N 


1 




6| 




13| 




16| 




3| 




ll 




7| 




■>\ 




4| 




12 1 




7| 




11 








5| 




1\ 


19 




MIN 


1 




«l 




5| 




5| 




5| 








6| 




5| 




5| 




•I 




5| 




8| 




5| 




6| 




5| 


5 




MAX 


1 




8| 




8| 




8| 




8| 




6| 




8| 




7| 




8| 




8| 




8| 




8| 




8! 




8| 




1\ 


8 


MC RTS: Overall 


MEAN 


1 


6. 




6. 


08 | 


6. 


19| 


5. 


67 | 


4 


. 00 | 


5 


.71| 


6, 


43| 


6 


75| 


6. 


33 | 


b. 


71 | 


6 


. 00 | 


6. 




f> 


.B0| 








rat: i ng, Form C 


STD 


1 


1 . 


17 | 


1. 


50 | 


1. 


47| 


0. 


58 | 




• 1 


1 


.25| 


1 . 


27| 


1 


50 | 


1 . 


15| 


1. 


70 | 




• 1 


1. 


nj 


1 


.10| 


1 


. 50 | 


1.37 



43- 



(CONTINUED) 



433 



ERIC 



13:00 Sunday, October 14, 1990 



CTC PILOT TEST ANALYSES: SFSU DATA 
Program: F : \DATA\SAS\PROGRAMS\SFSU_F . SAS 
Input Data: F:\DATA\SAS\DATASETS\SFSU_DAT.SSD 
Output File: F:\DATA\SAS\OUTPUT\SFSU_F.OUT 

Analysis F: Descriptive Statistics on Subtest and Major Test Ratings Summed Across Raters 



Candidate's 
gender 



Male 



I Fema- 
le 



Candidate's 
race 



I Candidate's! 

Candidate's preparation! grade | Candidate's location 









0 or 




Non- 






1 


2 


Mino- 


Mino- 


Miss- 


Cour- 


Cour- 


rity 


rity 


ing 


ses 


ses 



3 

Cour- 
ses 



High 
Scho- 
ol 



Midd- 
le/J- 

r . 
High 



Miss- 
ing 



Subu- 
rban 



Urban 



Inner 
City 



ALL 



MC RTS: Overall 
rating, Form C 



N 


1 


•I 


13| 


16| 


3| 


M 


7| 


7| 


«l 


12 1 


7| 


1 


•I 


5| 


7| 


19 


MIN 


1 


5| 


4| 


<l 


51 


«l 


*l 


«l 


5| 


«l 


4| 


6 




«l 


«l 


4 


MAX 


1 


81 


8| 


■1 


6| 


4| 


•1 


8| 


8| 


•1 


■1 


6 


1 81 


n 


8| 


8 



415 



